Re: [RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown
> On Aug 27, 2019, at 4:07 PM, Andy Lutomirski wrote: > > On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit wrote: >> When a shootdown is initiated, the initiating CPU has cycles to burn as >> it waits for the responding CPUs to receive the IPI and acknowledge it. >> In these cycles it is better to flush the user page-tables using >> INVPCID, instead of deferring the TLB flush. >> >> The best way to figure out whether there are cycles to burn is arguably >> to expose from the SMP layer when an acknowledgment is received. >> However, this would break some abstractions. >> >> Instead, use a simpler solution: the initiating CPU of a TLB shootdown >> would not defer PTI flushes. It is not always a win, relatively to >> deferring user page-table flushes, but it prevents performance >> regression. >> >> Signed-off-by: Nadav Amit >> --- >> arch/x86/include/asm/tlbflush.h | 1 + >> arch/x86/mm/tlb.c | 10 +- >> 2 files changed, 10 insertions(+), 1 deletion(-) >> >> diff --git a/arch/x86/include/asm/tlbflush.h >> b/arch/x86/include/asm/tlbflush.h >> index da56aa3ccd07..066b3804f876 100644 >> --- a/arch/x86/include/asm/tlbflush.h >> +++ b/arch/x86/include/asm/tlbflush.h >> @@ -573,6 +573,7 @@ struct flush_tlb_info { >>unsigned intinitiating_cpu; >>u8 stride_shift; >>u8 freed_tables; >> + u8 shootdown; > > I find the name "shootdown" to be confusing. How about "more_than_one_cpu”? I think the current semantic is more of “includes remote cpus”. How about calling it “local_only”, and negating its value?
Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes
> On Aug 27, 2019, at 4:13 PM, Andy Lutomirski wrote: > > On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit wrote: >> INVPCID is considerably slower than INVLPG of a single PTE. Using it to >> flush the user page-tables when PTI is enabled therefore introduces >> significant overhead. >> >> Instead, unless page-tables are released, it is possible to defer the >> flushing of the user page-tables until the time the code returns to >> userspace. These page tables are not in use, so deferring them is not a >> security hazard. > > I agree and, in fact, I argued against ever using INVPCID in the > original PTI code. > > However, I don't see what freeing page tables has to do with this. If > the CPU can actually do speculative page walks based on the contents > of non-current-PCID TLB entries, then we have major problems, since we > don't actively flush the TLB for non-running mms at all. That was not my concern. > > I suppose that, if we free a page table, then we can't activate the > PCID by writing to CR3 before flushing things. But we can still defer > the flush and just set the flush bit when we write to CR3. This was my concern. I can change the behavior so the code would flush the whole TLB instead. I just tried not to change the existing behavior too much.
Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI
> On Aug 27, 2019, at 4:18 PM, Andy Lutomirski wrote: > > On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit wrote: >> INVPCID is considerably slower than INVLPG of a single PTE, but it is >> currently used to flush PTEs in the user page-table when PTI is used. >> >> Instead, it is possible to defer TLB flushes until after the user >> page-tables are loaded. Preventing speculation over the TLB flushes >> should keep the whole thing safe. In some cases, deferring TLB flushes >> in such a way can result in more full TLB flushes, but arguably this >> behavior is oftentimes beneficial. > > I have a somewhat horrible suggestion. > > Would it make sense to refactor this so that it works for user *and* > kernel tables? In particular, if we flush a *kernel* mapping (vfree, > vunmap, set_memory_ro, etc), we shouldn't need to send an IPI to a > task that is running user code to flush most kernel mappings or even > to free kernel pagetables. The same trick could be done if we treat > idle like user mode for this purpose. > > In code, this could mostly consist of changing all the "user" data > structures involved to something like struct deferred_flush_info and > having one for user and one for kernel. > > I think this is horrible because it will enable certain workloads to > work considerably faster with PTI on than with PTI off, and that would > be a barely excusable moral failing. :-p > > For what it's worth, other than register clobber issues, the whole > "switch CR3 for PTI" logic ought to be doable in C. I don't know a > priori whether that would end up being an improvement. I implemented (and have not yet sent) another TLB deferring mechanism. It is intended for user mappings and not kernel one, but I think your suggestion shares some similar underlying rationale, and therefore challenges and solutions. Let me rephrase what you say to ensure we are on the same page. The basic idea is context-tracking to check whether each CPU is in kernel or user mode. Accordingly, TLB flushes can be deferred, but I don’t see that this solution is limited to PTI. There are 2 possible reasons, according to my understanding, that you limit the discussion to PTI: 1. PTI provides clear boundaries when user and kernel mappings are used. I am not sure that privilege-levels (and SMAP) do not do the same. 2. CR3 switching already imposes a memory barrier, which eliminates most of the cost of implementing such scheme which requires something which is similar to: write new context (kernel/user) mb(); if (need_flush) flush; I do agree that PTI addresses (2), but there is another problem. A reasonable implementation would store in a per-cpu state whether each CPU is in user/kernel, and the TLB shootdown initiator CPU would check the state to decide whether an IPI is needed. This means that pretty much every TLB shutdown would incur a cache-miss per-target CPU. This might cause performance regressions, at least in some cases. Admittedly, I did implement something similar (not sent) for user mappings: defer all TLB flushes and shootdowns if the CPUs are known to be in kernel mode. But I limited myself for certain cases, specifically “long” syscalls that are already likely to cause a TLB flush (e.g., msync()). I am not sure that tracking each CPU entry/exit would be a good idea. I will give some more thought about kernel mapping invalidations, which I did not think about enough. I tried to send what I considered “saner” and cleaner patches first. I still have patches I mentioned here, the async-flushes, and another patch to avoid local TLB flush on CoW and instead accessing the data.
Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes
> On Aug 27, 2019, at 11:28 AM, Dave Hansen wrote: > > On 8/23/19 3:52 PM, Nadav Amit wrote: >> INVPCID is considerably slower than INVLPG of a single PTE. Using it to >> flush the user page-tables when PTI is enabled therefore introduces >> significant overhead. > > I'm not sure this is worth all the churn, especially in the entry code. > For large flushes (> tlb_single_page_flush_ceiling), we don't do > INVPCIDs in the first place. It is possible to jump from flush_tlb_func() into the trampoline page, instead of flushing the TLB in the entry code. However, it induces higher overhead (switching CR3s), so it will only be useful if multiple TLB entries are flushed at once. It also prevents exploiting opportunities of promoting individual entry flushes into a full-TLB flush when multiple flushes are issued or when context switch takes place before returning-to-user-space. There are cases/workloads that flush multiple (but not too many) TLB entries on every syscall, for instance issuing msync() or running Apache webserver. So I am not sure that tlb_single_page_flush_ceiling saves the day. Besides, you may want to recalibrate (lower) tlb_single_page_flush_ceiling when PTI is used. > I'd really want to understand what the heck is going on that makes > INVPCID so slow, first. INVPCID-single is slow (even more than 133 cycles slower than INVLPG that you mentioned; I don’t have the numbers if front of me). I thought that this is a known fact, although, obviously, it does not make much sense.
Re: [PATCH] vmw_balloon: Fix offline page marking with compaction
> On Aug 21, 2019, at 1:30 AM, David Hildenbrand wrote: > > On 20.08.19 18:01, Nadav Amit wrote: >> The compaction code already marks pages as offline when it enqueues >> pages in the ballooned page list, and removes the mapping when the pages >> are removed from the list. VMware balloon also updates the flags, >> instead of letting the balloon-compaction logic handle it, which causes >> the assertion VM_BUG_ON_PAGE(!PageOffline(page)) to fire, when >> __ClearPageOffline is called the second time. This causes the following >> crash. >> >> [ 487.104520] kernel BUG at include/linux/page-flags.h:749! >> [ 487.106364] invalid opcode: [#1] SMP DEBUG_PAGEALLOC PTI >> [ 487.107681] CPU: 7 PID: 1106 Comm: kworker/7:3 Not tainted >> 5.3.0-rc5balloon #227 >> [ 487.109196] Hardware name: VMware, Inc. VMware Virtual Platform/440BX >> Desktop Reference Platform, BIOS 6.00 12/12/2018 >> [ 487.111452] Workqueue: events_freezable vmballoon_work [vmw_balloon] >> [ 487.112779] RIP: 0010:vmballoon_release_page_list+0xaa/0x100 [vmw_balloon] >> [ 487.114200] Code: fe 48 c1 e7 06 4c 01 c7 8b 47 30 41 89 c1 41 81 e1 00 >> 01 00 f0 41 81 f9 00 00 00 f0 74 d3 48 c7 c6 08 a1 a1 c0 e8 06 0d e7 ea <0f> >> 0b 44 89 f6 4c 89 c7 e8 49 9c e9 ea 49 8d 75 08 49 8b 45 08 4d >> [ 487.118033] RSP: 0018:b82f012bbc98 EFLAGS: 00010246 >> [ 487.119135] RAX: 0037 RBX: 0001 RCX: >> 0006 >> [ 487.120601] RDX: RSI: RDI: >> 9a85b6bd7620 >> [ 487.122071] RBP: b82f012bbcc0 R08: 0001 R09: >> >> [ 487.123536] R10: R11: R12: >> b82f012bbd00 >> [ 487.125002] R13: e97f4598d9c0 R14: R15: >> b82f012bbd34 >> [ 487.126463] FS: () GS:9a85b6bc() >> knlGS: >> [ 487.128110] CS: 0010 DS: ES: CR0: 80050033 >> [ 487.129316] CR2: 7ffe6e413ea0 CR3: 000230b18001 CR4: >> 003606e0 >> [ 487.130812] DR0: DR1: DR2: >> >> [ 487.132283] DR3: DR6: fffe0ff0 DR7: >> 0400 >> [ 487.133749] Call Trace: >> [ 487.134333] vmballoon_deflate+0x22c/0x390 [vmw_balloon] >> [ 487.135468] vmballoon_work+0x6e7/0x913 [vmw_balloon] >> [ 487.136711] ? process_one_work+0x21a/0x5e0 >> [ 487.138581] process_one_work+0x298/0x5e0 >> [ 487.139926] ? vmballoon_migratepage+0x310/0x310 [vmw_balloon] >> [ 487.141610] ? process_one_work+0x298/0x5e0 >> [ 487.143053] worker_thread+0x41/0x400 >> [ 487.144389] kthread+0x12b/0x150 >> [ 487.145582] ? process_one_work+0x5e0/0x5e0 >> [ 487.146937] ? kthread_create_on_node+0x60/0x60 >> [ 487.148637] ret_from_fork+0x3a/0x50 >> >> Fix it by updating the PageOffline indication only when a 2MB page is >> enqueued and dequeued. The 4KB pages will be handled correctly by the >> balloon compaction logic. >> >> Fixes: 83a8afa72e9c ("vmw_balloon: Compaction support") >> Cc: David Hildenbrand >> Reported-by: Thomas Hellstrom >> Signed-off-by: Nadav Amit >> --- >> drivers/misc/vmw_balloon.c | 10 -- >> 1 file changed, 8 insertions(+), 2 deletions(-) >> >> diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c >> index 8840299420e0..5e6be1527571 100644 >> --- a/drivers/misc/vmw_balloon.c >> +++ b/drivers/misc/vmw_balloon.c >> @@ -691,7 +691,6 @@ static int vmballoon_alloc_page_list(struct vmballoon *b, >> } >> >> if (page) { >> -vmballoon_mark_page_offline(page, ctl->page_size); >> /* Success. Add the page to the list and continue. */ >> list_add(>lru, >pages); >> continue; >> @@ -930,7 +929,6 @@ static void vmballoon_release_page_list(struct list_head >> *page_list, >> >> list_for_each_entry_safe(page, tmp, page_list, lru) { >> list_del(>lru); >> -vmballoon_mark_page_online(page, page_size); >> __free_pages(page, vmballoon_page_order(page_size)); >> } >> >> @@ -1005,6 +1003,7 @@ static void vmballoon_enqueue_page_list(struct >> vmballoon *b, >> enum vmballoon_page_size_type page_size) >> { >> unsigned long flags; >> +struct page *page; >> >> if (page_
Re: [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface
> On Aug 26, 2019, at 12:51 AM, Juergen Gross wrote: > > On 24.08.19 00:52, Nadav Amit wrote: >> __flush_tlb_one_user() currently flushes a single entry, and flushes it >> both in the kernel and user page-tables, when PTI is enabled. >> Change __flush_tlb_one_user() and related interfaces into >> __flush_tlb_range() that flushes a range and does not flush the user >> page-table. >> This refactoring is needed for the next patch, but regardless makes >> sense and has several advantages. First, only Xen-PV, which does not >> use PTI, implements the paravirtual interface of flush_tlb_one_user() so >> nothing is broken by separating the user and kernel page-table flushes, >> and the interface is more intuitive. >> Second, INVLPG can flush unrelated mappings, and it is also a >> serializing instruction. It is better to have a tight loop that flushes >> the entries. >> Third, currently __flush_tlb_one_kernel() also flushes the user >> page-tables, which is not needed. This allows to avoid this redundant >> flush. >> Cc: Boris Ostrovsky >> Cc: Juergen Gross >> Cc: Stefano Stabellini >> Cc: xen-de...@lists.xenproject.org >> Signed-off-by: Nadav Amit >> --- >> arch/x86/include/asm/paravirt.h | 5 ++-- >> arch/x86/include/asm/paravirt_types.h | 3 ++- >> arch/x86/include/asm/tlbflush.h | 24 + >> arch/x86/kernel/paravirt.c| 7 ++--- >> arch/x86/mm/tlb.c | 39 ++- >> arch/x86/xen/mmu_pv.c | 21 +-- >> 6 files changed, 62 insertions(+), 37 deletions(-) > > ... > >> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c >> index 48f7c7eb4dbc..ed68657f5e77 100644 >> --- a/arch/x86/xen/mmu_pv.c >> +++ b/arch/x86/xen/mmu_pv.c >> @@ -1325,22 +1325,27 @@ static noinline void xen_flush_tlb(void) >> preempt_enable(); >> } >> -static void xen_flush_tlb_one_user(unsigned long addr) >> +static void xen_flush_tlb_range(unsigned long start, unsigned long end, >> +u8 stride_shift) >> { >> struct mmuext_op *op; >> struct multicall_space mcs; >> - >> -trace_xen_mmu_flush_tlb_one_user(addr); >> +unsigned long addr; >> preempt_disable(); >> mcs = xen_mc_entry(sizeof(*op)); >> op = mcs.args; >> -op->cmd = MMUEXT_INVLPG_LOCAL; >> -op->arg1.linear_addr = addr & PAGE_MASK; >> -MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); >> - xen_mc_issue(PARAVIRT_LAZY_MMU); >> +for (addr = start; addr < end; addr += 1ul << stride_shift) { >> +trace_xen_mmu_flush_tlb_one_user(addr); >> + >> +op->cmd = MMUEXT_INVLPG_LOCAL; >> +op->arg1.linear_addr = addr & PAGE_MASK; >> +MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); >> + >> +xen_mc_issue(PARAVIRT_LAZY_MMU); >> +} > > For this kind of usage (a loop) you should: > > - replace the call of xen_mc_entry() with xen_mc_batch() > - use xen_extend_mmuext_op() for each loop iteration > - call xen_mc_issue() after the loop > > Additionally I'd like you to replace trace_xen_mmu_flush_tlb_one_user() > with trace_xen_mmu_flush_tlb_range() taking all three parameters and > keep it where it was (out of the loop). > > The paravirt parts seem to be okay. Thanks for the quick response. I don’t think the preempt_disable/enable() is needed in this case, but I kept them. Does the following match what you had in mind? --- arch/x86/xen/mmu_pv.c | 25 ++--- include/trace/events/xen.h | 18 -- 2 files changed, 26 insertions(+), 17 deletions(-) diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c index 48f7c7eb4dbc..faa01591df36 100644 --- a/arch/x86/xen/mmu_pv.c +++ b/arch/x86/xen/mmu_pv.c @@ -1325,20 +1325,23 @@ static noinline void xen_flush_tlb(void) preempt_enable(); } -static void xen_flush_tlb_one_user(unsigned long addr) +static void xen_flush_tlb_range(unsigned long start, unsigned long end, + u8 stride_shift) { - struct mmuext_op *op; - struct multicall_space mcs; - - trace_xen_mmu_flush_tlb_one_user(addr); + struct mmuext_op op; + unsigned long addr; preempt_disable(); - mcs = xen_mc_entry(sizeof(*op)); - op = mcs.args; - op->cmd = MMUEXT_INVLPG_LOCAL; - op->arg1.linear_addr = addr & PAGE_MASK; - MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF); + xen_mc_batch(); + op.cmd = MMUEXT_INVLPG_LOC
Re: [PATCH v4 1/9] smp: Run functions concurrently in smp_call_function_many()
> On Aug 23, 2019, at 3:41 PM, Nadav Amit wrote: > > Currently, on_each_cpu() and similar functions do not exploit the > potential of concurrency: the function is first executed remotely and > only then it is executed locally. Functions such as TLB flush can take > considerable time, so this provides an opportunity for performance > optimization. > > To do so, introduce __smp_call_function_many(), which allows the callers > to provide local and remote functions that should be executed, and run > them concurrently. Keep smp_call_function_many() semantic as it is today > for backward compatibility: the called function is not executed in this > case locally. > > __smp_call_function_many() does not use the optimized version for a > single remote target that smp_call_function_single() implements. For > synchronous function call, smp_call_function_single() keeps a > call_single_data (which is used for synchronization) on the stack. > Interestingly, it seems that not using this optimization provides > greater performance improvements (greater speedup with a single remote > target than with multiple ones). Presumably, holding data structures > that are intended for synchronization on the stack can introduce > overheads due to TLB misses and false-sharing when the stack is used for > other purposes. > > Adding support to run the functions concurrently required to remove a > micro-optimization in on_each_cpu() that disabled/enabled IRQs instead > of saving/restoring them. The benefit of running the local and remote > code concurrently is expected to be greater. > > Reviewed-by: Dave Hansen > Cc: Peter Zijlstra > Cc: Rik van Riel > Cc: Thomas Gleixner > Cc: Andy Lutomirski > Cc: Josh Poimboeuf > Signed-off-by: Nadav Amit > --- > include/linux/smp.h | 34 --- > kernel/smp.c| 138 +--- > 2 files changed, 92 insertions(+), 80 deletions(-) > > diff --git a/include/linux/smp.h b/include/linux/smp.h > index 6fc856c9eda5..d18d54199635 100644 > --- a/include/linux/smp.h > +++ b/include/linux/smp.h > @@ -32,11 +32,6 @@ extern unsigned int total_cpus; > int smp_call_function_single(int cpuid, smp_call_func_t func, void *info, >int wait); > > -/* > - * Call a function on all processors > - */ > -void on_each_cpu(smp_call_func_t func, void *info, int wait); > - > /* > * Call a function on processors specified by mask, which might include > * the local one. > @@ -44,6 +39,17 @@ void on_each_cpu(smp_call_func_t func, void *info, int > wait); > void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func, > void *info, bool wait); > > +/* > + * Call a function on all processors. May be used during early boot while > + * early_boot_irqs_disabled is set. > + */ > +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait) > +{ > + preempt_disable(); > + on_each_cpu_mask(cpu_online_mask, func, info, wait); > + preempt_enable(); > +} Err.. I made this change the last minute before sending, and apparently forgot to build, since it does not build. Let me know if there is anything else with this version, though.
Re: [PATCH] x86/mm: Do not split_large_page() for set_kernel_text_rw()
> On Aug 26, 2019, at 8:56 AM, Steven Rostedt wrote: > > On Mon, 26 Aug 2019 15:41:24 +0000 > Nadav Amit wrote: > >>> Anyway, I believe Nadav has some patches that converts ftrace to use >>> the shadow page modification trick somewhere. >> >> For the record - here is my previous patch: >> https://lkml.org/lkml/2018/12/5/211 > > FYI, when referencing older patches, please use lkml.kernel.org or > lore.kernel.org, lkml.org is slow and obsolete. > > ie. http://lkml.kernel.org/r/20181205013408.47725-9-na...@vmware.com Will do so next time.
Re: [PATCH] x86/mm: Do not split_large_page() for set_kernel_text_rw()
> On Aug 26, 2019, at 4:33 AM, Steven Rostedt wrote: > > On Fri, 23 Aug 2019 11:36:37 +0200 > Peter Zijlstra wrote: > >> On Thu, Aug 22, 2019 at 10:23:35PM -0700, Song Liu wrote: >>> As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to >>> split_large_page() for all kernel text pages. This means a single kprobe >>> will put all kernel text in 4k pages: >>> >>> root@ ~# grep 8100- /sys/kernel/debug/page_tables/kernel >>> 0x8100-0x8240 20M roPSE x pmd >>> >>> root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events >>> root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable >>> >>> root@ ~# grep 8100- /sys/kernel/debug/page_tables/kernel >>> 0x8100-0x8240 20M ro x pte >>> >>> To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check >>> in static_protections(). >>> >>> Two helper functions set_text_rw() and set_text_ro() are added to flip >>> _PAGE_RW bit for kernel text. >>> >>> [1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") >>> >> >> ARGH; so this is because ftrace flips the whole kernel range to RW and >> back for giggles? I'm thinking _that_ is a bug, it's a clear W^X >> violation. > > Since ftrace did this way before text_poke existed and way before > anybody cared (back in 2007), it's not really a bug. > > Anyway, I believe Nadav has some patches that converts ftrace to use > the shadow page modification trick somewhere. For the record - here is my previous patch: https://lkml.org/lkml/2018/12/5/211
[RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface
__flush_tlb_one_user() currently flushes a single entry, and flushes it both in the kernel and user page-tables, when PTI is enabled. Change __flush_tlb_one_user() and related interfaces into __flush_tlb_range() that flushes a range and does not flush the user page-table. This refactoring is needed for the next patch, but regardless makes sense and has several advantages. First, only Xen-PV, which does not use PTI, implements the paravirtual interface of flush_tlb_one_user() so nothing is broken by separating the user and kernel page-table flushes, and the interface is more intuitive. Second, INVLPG can flush unrelated mappings, and it is also a serializing instruction. It is better to have a tight loop that flushes the entries. Third, currently __flush_tlb_one_kernel() also flushes the user page-tables, which is not needed. This allows to avoid this redundant flush. Cc: Boris Ostrovsky Cc: Juergen Gross Cc: Stefano Stabellini Cc: xen-de...@lists.xenproject.org Signed-off-by: Nadav Amit --- arch/x86/include/asm/paravirt.h | 5 ++-- arch/x86/include/asm/paravirt_types.h | 3 ++- arch/x86/include/asm/tlbflush.h | 24 + arch/x86/kernel/paravirt.c| 7 ++--- arch/x86/mm/tlb.c | 39 ++- arch/x86/xen/mmu_pv.c | 21 +-- 6 files changed, 62 insertions(+), 37 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index bc4829c9b3f9..7347328eacd3 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -57,9 +57,10 @@ static inline void __flush_tlb_global(void) PVOP_VCALL0(mmu.flush_tlb_kernel); } -static inline void __flush_tlb_one_user(unsigned long addr) +static inline void __flush_tlb_range(unsigned long start, unsigned long end, +u8 stride_shift) { - PVOP_VCALL1(mmu.flush_tlb_one_user, addr); + PVOP_VCALL3(mmu.flush_tlb_range, start, end, stride_shift); } static inline void flush_tlb_multi(const struct cpumask *cpumask, diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 63fa751344bf..a87a5f236251 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -205,7 +205,8 @@ struct pv_mmu_ops { /* TLB operations */ void (*flush_tlb_user)(void); void (*flush_tlb_kernel)(void); - void (*flush_tlb_one_user)(unsigned long addr); + void (*flush_tlb_range)(unsigned long start, unsigned long end, + u8 stride_shift); void (*flush_tlb_multi)(const struct cpumask *cpus, const struct flush_tlb_info *info); diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 1f88ea410ff3..421bc82504e2 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -145,7 +145,7 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid) #else #define __flush_tlb() __native_flush_tlb() #define __flush_tlb_global() __native_flush_tlb_global() -#define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr) +#define __flush_tlb_range(start, end, stride_shift) __native_flush_tlb_range(start, end, stride_shift) #endif struct tlb_context { @@ -454,23 +454,13 @@ static inline void __native_flush_tlb_global(void) /* * flush one page in the user mapping */ -static inline void __native_flush_tlb_one_user(unsigned long addr) +static inline void __native_flush_tlb_range(unsigned long start, + unsigned long end, u8 stride_shift) { - u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); + unsigned long addr; - asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); - - if (!static_cpu_has(X86_FEATURE_PTI)) - return; - - /* -* Some platforms #GP if we call invpcid(type=1/2) before CR4.PCIDE=1. -* Just use invalidate_user_asid() in case we are called early. -*/ - if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) - invalidate_user_asid(loaded_mm_asid); - else - invpcid_flush_one(user_pcid(loaded_mm_asid), addr); + for (addr = start; addr < end; addr += 1ul << stride_shift) + asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); } /* @@ -512,7 +502,7 @@ static inline void __flush_tlb_one_kernel(unsigned long addr) * kernel address space and for its usermode counterpart, but it does * not flush it for other address spaces. */ - __flush_tlb_one_user(addr); + __flush_tlb_range(addr, addr + PAGE_SIZE, PAGE_SHIFT); if (!static_cpu_has(X86_FEATURE_PTI)) return; diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 5520a04c
[RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown
When a shootdown is initiated, the initiating CPU has cycles to burn as it waits for the responding CPUs to receive the IPI and acknowledge it. In these cycles it is better to flush the user page-tables using INVPCID, instead of deferring the TLB flush. The best way to figure out whether there are cycles to burn is arguably to expose from the SMP layer when an acknowledgment is received. However, this would break some abstractions. Instead, use a simpler solution: the initiating CPU of a TLB shootdown would not defer PTI flushes. It is not always a win, relatively to deferring user page-table flushes, but it prevents performance regression. Signed-off-by: Nadav Amit --- arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c | 10 +- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index da56aa3ccd07..066b3804f876 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -573,6 +573,7 @@ struct flush_tlb_info { unsigned intinitiating_cpu; u8 stride_shift; u8 freed_tables; + u8 shootdown; }; #define local_flush_tlb() __flush_tlb() diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 31260c55d597..ba50430275d4 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -592,8 +592,13 @@ static void flush_tlb_user_pt_range(u16 asid, const struct flush_tlb_info *f) /* * We can defer flushes as long as page-tables were not freed. +* +* However, if there is a shootdown the initiating CPU has cycles to +* spare, while it waits for the other cores to respond. In this case, +* deferring the flushing can cause overheads, so avoid it. */ - if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables) { + if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables && + (!f->shootdown || f->initiating_cpu != smp_processor_id())) { flush_user_tlb_deferred(asid, start, end, stride_shift); return; } @@ -861,6 +866,7 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, info->freed_tables = freed_tables; info->new_tlb_gen = new_tlb_gen; info->initiating_cpu= smp_processor_id(); + info->shootdown = false; return info; } @@ -903,6 +909,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, * flush_tlb_func_local() directly in this case. */ if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { + info->shootdown = true; flush_tlb_multi(mm_cpumask(mm), info); } else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { lockdep_assert_irqs_enabled(); @@ -970,6 +977,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) * flush_tlb_func_local() directly in this case. */ if (cpumask_any_but(>cpumask, cpu) < nr_cpu_ids) { + info->shootdown = true; flush_tlb_multi(>cpumask, info); } else if (cpumask_test_cpu(cpu, >cpumask)) { lockdep_assert_irqs_enabled(); -- 2.17.1
[RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes
INVPCID is considerably slower than INVLPG of a single PTE. Using it to flush the user page-tables when PTI is enabled therefore introduces significant overhead. Instead, unless page-tables are released, it is possible to defer the flushing of the user page-tables until the time the code returns to userspace. These page tables are not in use, so deferring them is not a security hazard. When CR3 is loaded, as part of returning to userspace, use INVLPG to flush the relevant PTEs. Use LFENCE to prevent speculative executions that skip INVLPG. There are some caveats, which sometime require a full TLB flush of the user page-tables. There are some (uncommon) code-paths that reload CR3 in which there is not stack. If a context-switch happens and there are pending flushes, tracking which TLB flushes are later needed is complicated and expensive. If there are multiple TLB flushes of different ranges before the kernel returns to userspace, the overhead of tracking them can exceed the benefit. In these cases, perform a full TLB flush. It is possible to avoid them in some cases, but the benefit in doing so is questionable. Signed-off-by: Nadav Amit --- arch/x86/entry/calling.h| 52 ++-- arch/x86/include/asm/tlbflush.h | 30 +++--- arch/x86/kernel/asm-offsets.c | 3 ++ arch/x86/mm/tlb.c | 70 + 4 files changed, 147 insertions(+), 8 deletions(-) diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h index 515c0ceeb4a3..a4d46416853d 100644 --- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -6,6 +6,7 @@ #include #include #include +#include /* @@ -205,7 +206,16 @@ For 32-bit we have the following conventions - kernel is built with #define THIS_CPU_user_pcid_flush_mask \ PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask -.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req +#define THIS_CPU_user_flush_start \ + PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_start + +#define THIS_CPU_user_flush_end\ + PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_end + +#define THIS_CPU_user_flush_stride_shift \ + PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_stride_shift + +.macro SWITCH_TO_USER_CR3 scratch_reg:req scratch_reg2:req has_stack:req ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI mov %cr3, \scratch_reg @@ -221,9 +231,41 @@ For 32-bit we have the following conventions - kernel is built with /* Flush needed, clear the bit */ btr \scratch_reg, THIS_CPU_user_pcid_flush_mask +.if \has_stack + cmpq$(TLB_FLUSH_ALL), THIS_CPU_user_flush_end + jnz .Lpartial_flush_\@ +.Ldo_full_flush_\@: +.endif movq\scratch_reg2, \scratch_reg jmp .Lwrcr3_pcid_\@ - +.if \has_stack +.Lpartial_flush_\@: + /* Prepare CR3 with PGD of user, and no flush set */ + orq $(PTI_USER_PGTABLE_AND_PCID_MASK), \scratch_reg2 + SET_NOFLUSH_BIT \scratch_reg2 + pushq %rsi + pushq %rbx + pushq %rcx + movbTHIS_CPU_user_flush_stride_shift, %cl + movq$1, %rbx + shl %cl, %rbx + movqTHIS_CPU_user_flush_start, %rsi + movqTHIS_CPU_user_flush_end, %rcx + /* Load the new cr3 and flush */ + mov \scratch_reg2, %cr3 +.Lflush_loop_\@: + invlpg (%rsi) + addq%rbx, %rsi + cmpq%rsi, %rcx + ja .Lflush_loop_\@ + /* Prevent speculatively skipping flushes */ + lfence + + popq%rcx + popq%rbx + popq%rsi + jmp .Lend_\@ +.endif .Lnoflush_\@: movq\scratch_reg2, \scratch_reg SET_NOFLUSH_BIT \scratch_reg @@ -239,9 +281,13 @@ For 32-bit we have the following conventions - kernel is built with .Lend_\@: .endm +.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req + SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax has_stack=0 +.endm + .macro SWITCH_TO_USER_CR3_STACKscratch_reg:req pushq %rax - SWITCH_TO_USER_CR3_NOSTACK scratch_reg=\scratch_reg scratch_reg2=%rax + SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax has_stack=1 popq%rax .endm diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 421bc82504e2..da56aa3ccd07 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -2,6 +2,10 @@ #ifndef _ASM_X86_TLBFLUSH_H #define _ASM_X86_TLBFLUSH_H +#define TLB_FLUSH_ALL -1UL + +#ifndef __ASSEMBLY__ + #include #include @@ -222,6 +226,10 @@ struct tlb_state { * context 0. */ struct tlb_context ctxs[TLB_NR_DYN_ASIDS]; + + unsigned long user_flush_start; + unsigned long user_flush_end; + unsigned long user_flush_stride_shift; }; DECLARE_PER_CPU_ALIGNED(struct tlb
[RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI
INVPCID is considerably slower than INVLPG of a single PTE, but it is currently used to flush PTEs in the user page-table when PTI is used. Instead, it is possible to defer TLB flushes until after the user page-tables are loaded. Preventing speculation over the TLB flushes should keep the whole thing safe. In some cases, deferring TLB flushes in such a way can result in more full TLB flushes, but arguably this behavior is oftentimes beneficial. These patches are based and evaluated on top of the concurrent TLB-flushes v4 patch-set. I will provide more results later, but it might be easier to look at the time an isolated TLB flush takes. These numbers are from skylake, showing the number of cycles that running madvise(DONTNEED) which results in local TLB flushes takes: n_pages concurrent +deferred-pti change --- -- - -- 1 21191986-6.7% 10 67915417 -20% Please let me know if I missed something that affects security or performance. [ Yes, I know there is another pending RFC for async TLB flushes, but I think it might be easier to merge this one first ] RFC v1 -> RFC v2: * Wrong patches were sent before Nadav Amit (3): x86/mm/tlb: Change __flush_tlb_one_user interface x86/mm/tlb: Defer PTI flushes x86/mm/tlb: Avoid deferring PTI flushes on shootdown arch/x86/entry/calling.h | 52 +++- arch/x86/include/asm/paravirt.h | 5 +- arch/x86/include/asm/paravirt_types.h | 3 +- arch/x86/include/asm/tlbflush.h | 55 +++- arch/x86/kernel/asm-offsets.c | 3 + arch/x86/kernel/paravirt.c| 7 +- arch/x86/mm/tlb.c | 117 -- arch/x86/xen/mmu_pv.c | 21 +++-- 8 files changed, 218 insertions(+), 45 deletions(-) -- 2.17.1
Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI
Sorry, I made a mistake and included the wrong patches. I will send RFC v2 in few minutes. > On Aug 23, 2019, at 3:46 PM, Nadav Amit wrote: > > INVPCID is considerably slower than INVLPG of a single PTE, but it is > currently used to flush PTEs in the user page-table when PTI is used. > > Instead, it is possible to defer TLB flushes until after the user > page-tables are loaded. Preventing speculation over the TLB flushes > should keep the whole thing safe. In some cases, deferring TLB flushes > in such a way can result in more full TLB flushes, but arguably this > behavior is oftentimes beneficial. > > These patches are based and evaluated on top of the concurrent > TLB-flushes v4 patch-set. > > I will provide more results later, but it might be easier to look at the > time an isolated TLB flush takes. These numbers are from skylake, > showing the number of cycles that running madvise(DONTNEED) which > results in local TLB flushes takes: > > n_pages concurrent +deferred-pti change > --- -- - -- > 1 21191986-6.7% > 1067915417 -20% > > Please let me know if I missed something that affects security or > performance. > > [ Yes, I know there is another pending RFC for async TLB flushes, but I > think it might be easier to merge this one first ] > > Nadav Amit (3): > x86/mm/tlb: Defer PTI flushes > x86/mm/tlb: Avoid deferring PTI flushes on shootdown > x86/mm/tlb: Use lockdep irq assertions > > arch/x86/entry/calling.h| 52 +++-- > arch/x86/include/asm/tlbflush.h | 31 ++-- > arch/x86/kernel/asm-offsets.c | 3 ++ > arch/x86/mm/tlb.c | 83 +++-- > 4 files changed, 158 insertions(+), 11 deletions(-) > > -- > 2.17.1
[RFC PATCH 1/3] x86/mm/tlb: Defer PTI flushes
INVPCID is considerably slower than INVLPG of a single PTE. Using it to flush the user page-tables when PTI is enabled therefore introduces significant overhead. Instead, unless page-tables are released, it is possible to defer the flushing of the user page-tables until the time the code returns to userspace. These page tables are not in use, so deferring them is not a security hazard. When CR3 is loaded, as part of returning to userspace, use INVLPG to flush the relevant PTEs. Use LFENCE to prevent speculative executions that skip INVLPG. There are some caveats, which sometime require a full TLB flush of the user page-tables. There are some (uncommon) code-paths that reload CR3 in which there is not stack. If a context-switch happens and there are pending flushes, tracking which TLB flushes are later needed is complicated and expensive. If there are multiple TLB flushes of different ranges before the kernel returns to userspace, the overhead of tracking them can exceed the benefit. In these cases, perform a full TLB flush. It is possible to avoid them in some cases, but the benefit in doing so is questionable. Signed-off-by: Nadav Amit --- arch/x86/entry/calling.h| 52 ++-- arch/x86/include/asm/tlbflush.h | 30 +++--- arch/x86/kernel/asm-offsets.c | 3 ++ arch/x86/mm/tlb.c | 70 + 4 files changed, 147 insertions(+), 8 deletions(-) diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h index 515c0ceeb4a3..a4d46416853d 100644 --- a/arch/x86/entry/calling.h +++ b/arch/x86/entry/calling.h @@ -6,6 +6,7 @@ #include #include #include +#include /* @@ -205,7 +206,16 @@ For 32-bit we have the following conventions - kernel is built with #define THIS_CPU_user_pcid_flush_mask \ PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask -.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req +#define THIS_CPU_user_flush_start \ + PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_start + +#define THIS_CPU_user_flush_end\ + PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_end + +#define THIS_CPU_user_flush_stride_shift \ + PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_stride_shift + +.macro SWITCH_TO_USER_CR3 scratch_reg:req scratch_reg2:req has_stack:req ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI mov %cr3, \scratch_reg @@ -221,9 +231,41 @@ For 32-bit we have the following conventions - kernel is built with /* Flush needed, clear the bit */ btr \scratch_reg, THIS_CPU_user_pcid_flush_mask +.if \has_stack + cmpq$(TLB_FLUSH_ALL), THIS_CPU_user_flush_end + jnz .Lpartial_flush_\@ +.Ldo_full_flush_\@: +.endif movq\scratch_reg2, \scratch_reg jmp .Lwrcr3_pcid_\@ - +.if \has_stack +.Lpartial_flush_\@: + /* Prepare CR3 with PGD of user, and no flush set */ + orq $(PTI_USER_PGTABLE_AND_PCID_MASK), \scratch_reg2 + SET_NOFLUSH_BIT \scratch_reg2 + pushq %rsi + pushq %rbx + pushq %rcx + movbTHIS_CPU_user_flush_stride_shift, %cl + movq$1, %rbx + shl %cl, %rbx + movqTHIS_CPU_user_flush_start, %rsi + movqTHIS_CPU_user_flush_end, %rcx + /* Load the new cr3 and flush */ + mov \scratch_reg2, %cr3 +.Lflush_loop_\@: + invlpg (%rsi) + addq%rbx, %rsi + cmpq%rsi, %rcx + ja .Lflush_loop_\@ + /* Prevent speculatively skipping flushes */ + lfence + + popq%rcx + popq%rbx + popq%rsi + jmp .Lend_\@ +.endif .Lnoflush_\@: movq\scratch_reg2, \scratch_reg SET_NOFLUSH_BIT \scratch_reg @@ -239,9 +281,13 @@ For 32-bit we have the following conventions - kernel is built with .Lend_\@: .endm +.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req + SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax has_stack=0 +.endm + .macro SWITCH_TO_USER_CR3_STACKscratch_reg:req pushq %rax - SWITCH_TO_USER_CR3_NOSTACK scratch_reg=\scratch_reg scratch_reg2=%rax + SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax has_stack=1 popq%rax .endm diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 421bc82504e2..da56aa3ccd07 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -2,6 +2,10 @@ #ifndef _ASM_X86_TLBFLUSH_H #define _ASM_X86_TLBFLUSH_H +#define TLB_FLUSH_ALL -1UL + +#ifndef __ASSEMBLY__ + #include #include @@ -222,6 +226,10 @@ struct tlb_state { * context 0. */ struct tlb_context ctxs[TLB_NR_DYN_ASIDS]; + + unsigned long user_flush_start; + unsigned long user_flush_end; + unsigned long user_flush_stride_shift; }; DECLARE_PER_CPU_ALIGNED(struct tlb
[RFC PATCH 2/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown
When a shootdown is initiated, the initiating CPU has cycles to burn as it waits for the responding CPUs to receive the IPI and acknowledge it. In these cycles it is better to flush the user page-tables using INVPCID, instead of deferring the TLB flush. The best way to figure out whether there are cycles to burn is arguably to expose from the SMP layer when an acknowledgment is received. However, this would break some abstractions. Instead, use a simpler solution: the initiating CPU of a TLB shootdown would not defer PTI flushes. It is not always a win, relatively to deferring user page-table flushes, but it prevents performance regression. Signed-off-by: Nadav Amit --- arch/x86/include/asm/tlbflush.h | 1 + arch/x86/mm/tlb.c | 10 +- 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index da56aa3ccd07..066b3804f876 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -573,6 +573,7 @@ struct flush_tlb_info { unsigned intinitiating_cpu; u8 stride_shift; u8 freed_tables; + u8 shootdown; }; #define local_flush_tlb() __flush_tlb() diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 31260c55d597..ba50430275d4 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -592,8 +592,13 @@ static void flush_tlb_user_pt_range(u16 asid, const struct flush_tlb_info *f) /* * We can defer flushes as long as page-tables were not freed. +* +* However, if there is a shootdown the initiating CPU has cycles to +* spare, while it waits for the other cores to respond. In this case, +* deferring the flushing can cause overheads, so avoid it. */ - if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables) { + if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables && + (!f->shootdown || f->initiating_cpu != smp_processor_id())) { flush_user_tlb_deferred(asid, start, end, stride_shift); return; } @@ -861,6 +866,7 @@ static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, info->freed_tables = freed_tables; info->new_tlb_gen = new_tlb_gen; info->initiating_cpu= smp_processor_id(); + info->shootdown = false; return info; } @@ -903,6 +909,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, * flush_tlb_func_local() directly in this case. */ if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) { + info->shootdown = true; flush_tlb_multi(mm_cpumask(mm), info); } else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { lockdep_assert_irqs_enabled(); @@ -970,6 +977,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) * flush_tlb_func_local() directly in this case. */ if (cpumask_any_but(>cpumask, cpu) < nr_cpu_ids) { + info->shootdown = true; flush_tlb_multi(>cpumask, info); } else if (cpumask_test_cpu(cpu, >cpumask)) { lockdep_assert_irqs_enabled(); -- 2.17.1
[RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI
INVPCID is considerably slower than INVLPG of a single PTE, but it is currently used to flush PTEs in the user page-table when PTI is used. Instead, it is possible to defer TLB flushes until after the user page-tables are loaded. Preventing speculation over the TLB flushes should keep the whole thing safe. In some cases, deferring TLB flushes in such a way can result in more full TLB flushes, but arguably this behavior is oftentimes beneficial. These patches are based and evaluated on top of the concurrent TLB-flushes v4 patch-set. I will provide more results later, but it might be easier to look at the time an isolated TLB flush takes. These numbers are from skylake, showing the number of cycles that running madvise(DONTNEED) which results in local TLB flushes takes: n_pages concurrent +deferred-pti change --- -- - -- 1 21191986-6.7% 10 67915417 -20% Please let me know if I missed something that affects security or performance. [ Yes, I know there is another pending RFC for async TLB flushes, but I think it might be easier to merge this one first ] Nadav Amit (3): x86/mm/tlb: Defer PTI flushes x86/mm/tlb: Avoid deferring PTI flushes on shootdown x86/mm/tlb: Use lockdep irq assertions arch/x86/entry/calling.h| 52 +++-- arch/x86/include/asm/tlbflush.h | 31 ++-- arch/x86/kernel/asm-offsets.c | 3 ++ arch/x86/mm/tlb.c | 83 +++-- 4 files changed, 158 insertions(+), 11 deletions(-) -- 2.17.1
[RFC PATCH 3/3] x86/mm/tlb: Use lockdep irq assertions
The assertions that check whether IRQs are disabled depend currently on different debug features. Use instead lockdep_assert_irqs_disabled(), which is standard, enabled by the same debug feature, and provides more information upon failures. Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index ba50430275d4..6f4ce02e2c5b 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -293,8 +293,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, */ /* We don't want flush_tlb_func() to run concurrently with us. */ - if (IS_ENABLED(CONFIG_PROVE_LOCKING)) - WARN_ON_ONCE(!irqs_disabled()); + lockdep_assert_irqs_disabled(); /* * Verify that CR3 is what we think it is. This will catch @@ -643,7 +642,7 @@ static void flush_tlb_func(void *info) unsigned long nr_invalidate = 0; /* This code cannot presently handle being reentered. */ - VM_WARN_ON(!irqs_disabled()); + lockdep_assert_irqs_disabled(); if (!local) { inc_irq_stat(irq_tlb_count); -- 2.17.1
[PATCH 3/7] x86/percpu: Use C for percpu accesses when possible
The percpu code mostly uses inline assembly. Using segment qualifiers allows to use C code instead, which enables the compiler to perform various optimizations (e.g., CSE). For example, in __schedule() the following two instructions: mov%gs:0x7e5f1eff(%rip),%edx# 0x10350 movslq %edx,%rdx Turn with this patch into: movslq %gs:0x7e5f2e6e(%rip),%rax# 0x10350 In addition, operations that have no guarantee against concurrent interrupts or preemption, such as __this_cpu_cmpxchg() can be further optimized by the compiler when they are implemented in C, for example in call_timer_fn(). Signed-off-by: Nadav Amit --- arch/x86/include/asm/percpu.h | 115 +++--- 1 file changed, 105 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 1fe348884477..13987f9bc82f 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -439,13 +439,88 @@ do { \ */ #define this_cpu_read_stable(var) percpu_stable_op("mov", var) +#if USE_X86_SEG_SUPPORT + +#define __raw_cpu_read(qual, pcp) \ +({ \ + *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \ +}) + +#define __raw_cpu_write(qual, pcp, val) \ + do {\ + *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \ + } while (0) + +/* + * Performance-wise, C operations are only more efficient than their inline + * assembly counterparts for non-volatile variables (__this_*) and for volatile + * loads and stores. + * + * Since we do not use assembly, we are free to define 64-bit operations + * on 32-bit architecture + */ +#define __raw_cpu_add(pcp, val)do { __my_cpu_var(pcp) += (val); } while (0) +#define __raw_cpu_and(pcp, val)do { __my_cpu_var(pcp) &= (val); } while (0) +#define __raw_cpu_or(pcp, val) do { __my_cpu_var(pcp) |= (val); } while (0) +#define __raw_cpu_add_return(pcp, val) ({ __my_cpu_var(pcp) += (val); }) + +#define __raw_cpu_xchg(pcp, val) \ +({ \ + typeof(pcp) pxo_ret__ = __my_cpu_var(pcp); \ + \ + __my_cpu_var(pcp) = (val); \ + pxo_ret__; \ +}) + +#define __raw_cpu_cmpxchg(pcp, oval, nval) \ +({ \ + __my_cpu_type(pcp) *__p = __my_cpu_ptr(&(pcp)); \ + \ + typeof(pcp) __ret = *__p; \ + \ + if (__ret == (oval))\ + *__p = nval;\ + __ret; \ +}) + +#define raw_cpu_read_1(pcp)__raw_cpu_read(, pcp) +#define raw_cpu_read_2(pcp)__raw_cpu_read(, pcp) +#define raw_cpu_read_4(pcp)__raw_cpu_read(, pcp) +#define raw_cpu_write_1(pcp, val) __raw_cpu_write(, pcp, val) +#define raw_cpu_write_2(pcp, val) __raw_cpu_write(, pcp, val) +#define raw_cpu_write_4(pcp, val) __raw_cpu_write(, pcp, val) +#define raw_cpu_add_1(pcp, val)__raw_cpu_add(pcp, val) +#define raw_cpu_add_2(pcp, val)__raw_cpu_add(pcp, val) +#define raw_cpu_add_4(pcp, val)__raw_cpu_add(pcp, val) +#define raw_cpu_and_1(pcp, val)__raw_cpu_and(pcp, val) +#define raw_cpu_and_2(pcp, val)__raw_cpu_and(pcp, val) +#define raw_cpu_and_4(pcp, val)__raw_cpu_and(pcp, val) +#define raw_cpu_or_1(pcp, val) __raw_cpu_or(pcp, val) +#define raw_cpu_or_2(pcp, val) __raw_cpu_or(pcp, val) +#define raw_cpu_or_4(pcp, val) __raw_cpu_or(pcp, val) +#define raw_cpu_xchg_1(pcp, val) __raw_cpu_xchg(pcp, val) +#define raw_cpu_xchg_2(pcp, val) __raw_cpu_xchg(pcp, val) +#define raw_cpu_xchg_4(pcp, val) __raw_cpu_xchg(pcp, val) +#define raw_cpu_add_return_1(pcp, val) __raw_cpu_add_return(pcp, val) +#define raw_cpu_add_return_2(pcp, val) __raw_cpu_add_return(pcp, val) +#define raw_cpu_add_return_4(pcp, val) __raw_cpu_add_return(pcp, val) +#define raw_cpu_add_return_8(pcp, val) __raw_cpu_add_return(pcp, val) +#define raw_cpu_cmpxchg_1(pcp, oval, nval) __raw_cp
[PATCH 5/7] percpu: Assume preemption is disabled on per_cpu_ptr()
When per_cpu_ptr() is used, the caller should have preemption disabled, as otherwise the pointer is meaningless. If the user wants an "unstable" pointer he should call raw_cpu_ptr(). Add an assertion to check that indeed preemption is disabled, and distinguish between the two cases to allow further, per-arch optimizations. Signed-off-by: Nadav Amit --- include/asm-generic/percpu.h | 12 include/linux/percpu-defs.h | 33 - 2 files changed, 44 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h index c2de013b2cf4..7853605f4210 100644 --- a/include/asm-generic/percpu.h +++ b/include/asm-generic/percpu.h @@ -36,6 +36,14 @@ extern unsigned long __per_cpu_offset[NR_CPUS]; #define my_cpu_offset __my_cpu_offset #endif +/* + * Determine the offset of the current active processor when preemption is + * disabled. Can be overriden by arch code. + */ +#ifndef __raw_my_cpu_offset +#define __raw_my_cpu_offset __my_cpu_offset +#endif + /* * Arch may define arch_raw_cpu_ptr() to provide more efficient address * translations for raw_cpu_ptr(). @@ -44,6 +52,10 @@ extern unsigned long __per_cpu_offset[NR_CPUS]; #define arch_raw_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset) #endif +#ifndef arch_raw_cpu_ptr_preemptable +#define arch_raw_cpu_ptr_preemptable(ptr) SHIFT_PERCPU_PTR(ptr, __raw_my_cpu_offset) +#endif + #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA extern void setup_per_cpu_areas(void); #endif diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index a6fabd865211..13afca8a37e7 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -237,20 +237,51 @@ do { \ SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); \ }) +#ifndef arch_raw_cpu_ptr_preemption_disabled +#define arch_raw_cpu_ptr_preemption_disabled(ptr) \ + arch_raw_cpu_ptr(ptr) +#endif + +#define raw_cpu_ptr_preemption_disabled(ptr) \ +({ \ + __verify_pcpu_ptr(ptr); \ + arch_raw_cpu_ptr_preemption_disabled(ptr); \ +}) + +/* + * If preemption is enabled, we need to read the pointer atomically on + * raw_cpu_ptr(). However if it is disabled, we can use the + * raw_cpu_ptr_nopreempt(), which is potentially more efficient. Similarly, we + * can use the preemption-disabled version if the kernel is non-preemptable or + * if voluntary preemption is used. + */ +#ifdef CONFIG_PREEMPT + #define raw_cpu_ptr(ptr) \ ({ \ __verify_pcpu_ptr(ptr); \ arch_raw_cpu_ptr(ptr); \ }) +#else + +#define raw_cpu_ptr(ptr) raw_cpu_ptr_preemption_disabled(ptr) + +#endif + #ifdef CONFIG_DEBUG_PREEMPT +/* + * Unlike other this_cpu_* operations, this_cpu_ptr() requires that preemption + * will be disabled. In contrast, raw_cpu_ptr() does not require that. + */ #define this_cpu_ptr(ptr) \ ({ \ + __this_cpu_preempt_check("ptr");\ __verify_pcpu_ptr(ptr); \ SHIFT_PERCPU_PTR(ptr, my_cpu_offset); \ }) #else -#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr) +#define this_cpu_ptr(ptr) raw_cpu_ptr_preemption_disabled(ptr) #endif #else /* CONFIG_SMP */ -- 2.17.1
[PATCH 1/7] compiler: Report x86 segment support
GCC v6+ supports x86 segment qualifiers (__seg_gs and __seg_fs). Define COMPILER_HAS_X86_SEG_SUPPORT when it is supported. Signed-off-by: Nadav Amit --- include/linux/compiler-gcc.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index d7ee4c6bad48..5967590a18c6 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -149,6 +149,10 @@ #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1 #endif +#if GCC_VERSION >= 6 +#define COMPILER_HAS_X86_SEG_SUPPORT 1 +#endif + /* * Turn individual warnings and errors on and off locally, depending * on version. -- 2.17.1
[PATCH 4/7] x86: Fix possible caching of current_task
this_cpu_read_stable() is allowed and supposed to cache and return the same value, specifically for current_task. It actually does not cache current_task very well, which hinders possible invalid caching when the task is switched in __switch_to(). Fix the possible caching by avoiding the use of current in __switch_to()'s dynamic extent. Signed-off-by: Nadav Amit --- arch/x86/include/asm/fpu/internal.h| 7 --- arch/x86/include/asm/resctrl_sched.h | 14 +++--- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 ++-- arch/x86/kernel/process_32.c | 4 ++-- arch/x86/kernel/process_64.c | 4 ++-- 5 files changed, 17 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 4c95c365058a..b537788600fe 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -588,9 +588,10 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu) /* * Load PKRU from the FPU context if available. Delay loading of the - * complete FPU state until the return to userland. + * complete FPU state until the return to userland. Avoid reading current during + * switch. */ -static inline void switch_fpu_finish(struct fpu *new_fpu) +static inline void switch_fpu_finish(struct task_struct *task, struct fpu *new_fpu) { u32 pkru_val = init_pkru_value; struct pkru_state *pk; @@ -598,7 +599,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu) if (!static_cpu_has(X86_FEATURE_FPU)) return; - set_thread_flag(TIF_NEED_FPU_LOAD); + set_ti_thread_flag(task_thread_info(task), TIF_NEED_FPU_LOAD); if (!cpu_feature_enabled(X86_FEATURE_OSPKE)) return; diff --git a/arch/x86/include/asm/resctrl_sched.h b/arch/x86/include/asm/resctrl_sched.h index f6b7fe2833cc..9a00d9df9d02 100644 --- a/arch/x86/include/asm/resctrl_sched.h +++ b/arch/x86/include/asm/resctrl_sched.h @@ -51,7 +51,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key); * simple as possible. * Must be called with preemption disabled. */ -static void __resctrl_sched_in(void) +static void __resctrl_sched_in(struct task_struct *task) { struct resctrl_pqr_state *state = this_cpu_ptr(_state); u32 closid = state->default_closid; @@ -62,13 +62,13 @@ static void __resctrl_sched_in(void) * Else use the closid/rmid assigned to this cpu. */ if (static_branch_likely(_alloc_enable_key)) { - if (current->closid) + if (task->closid) closid = current->closid; } if (static_branch_likely(_mon_enable_key)) { - if (current->rmid) - rmid = current->rmid; + if (task->rmid) + rmid = task->rmid; } if (closid != state->cur_closid || rmid != state->cur_rmid) { @@ -78,15 +78,15 @@ static void __resctrl_sched_in(void) } } -static inline void resctrl_sched_in(void) +static inline void resctrl_sched_in(struct task_struct *task) { if (static_branch_likely(_enable_key)) - __resctrl_sched_in(); + __resctrl_sched_in(task); } #else -static inline void resctrl_sched_in(void) {} +static inline void resctrl_sched_in(struct task_struct *task) {} #endif /* CONFIG_X86_CPU_RESCTRL */ diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index a46dee8e78db..5fcf56cbf438 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -311,7 +311,7 @@ static void update_cpu_closid_rmid(void *info) * executing task might have its own closid selected. Just reuse * the context switch code. */ - resctrl_sched_in(); + resctrl_sched_in(current); } /* @@ -536,7 +536,7 @@ static void move_myself(struct callback_head *head) preempt_disable(); /* update PQR_ASSOC MSR to make resource group go into effect */ - resctrl_sched_in(); + resctrl_sched_in(current); preempt_enable(); kfree(callback); diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index b8ceec4974fe..699a4c95ab13 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -292,10 +292,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) this_cpu_write(current_task, next_p); - switch_fpu_finish(next_fpu); + switch_fpu_finish(next_p, next_fpu); /* Load the Intel cache allocation PQR MSR. */ - resctrl_sched_in(); + resctrl_sched_in(next_p); return prev_p; } diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index af64519b2695..bb811808936d 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -565,7 +565,7 @@ __switch
[PATCH 7/7] x86/current: Aggressive caching of current
The current_task is supposed to be constant in each thread and therefore does not need to be reread. There is already an attempt to cache it using inline assembly, using this_cpu_read_stable(), which hides the dependency on the read memory address. However, this caching is not working very well. For example, sync_mm_rss() still reads current_task twice for no reason. Allow more aggressive caching by aliasing current_task to into a constant const_current_task and reading from the constant copy. Doing so requires the compiler to support x86 segment qualifiers. Hide const_current_task in a different compilation unit to avoid the compiler from assuming that the value is constant during compilation. Signed-off-by: Nadav Amit --- arch/x86/include/asm/current.h | 30 ++ arch/x86/kernel/cpu/Makefile | 1 + arch/x86/kernel/cpu/common.c | 7 +-- arch/x86/kernel/cpu/current.c | 16 include/linux/compiler.h | 2 +- 5 files changed, 49 insertions(+), 7 deletions(-) create mode 100644 arch/x86/kernel/cpu/current.c diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h index 3e204e6140b5..7f093e81a647 100644 --- a/arch/x86/include/asm/current.h +++ b/arch/x86/include/asm/current.h @@ -10,11 +10,41 @@ struct task_struct; DECLARE_PER_CPU(struct task_struct *, current_task); +#if USE_X86_SEG_SUPPORT + +/* + * Hold a constant alias for current_task, which would allow to avoid caching of + * current task. + * + * We must mark const_current_task with the segment qualifiers, as otherwise gcc + * would do redundant reads of const_current_task. + */ +DECLARE_PER_CPU(struct task_struct * const __percpu_seg_override, const_current_task); + +static __always_inline struct task_struct *get_current(void) +{ + + /* +* GCC is missing functionality of removing segment qualifiers, which +* messes with per-cpu infrastructure that holds local copies. Use +* __raw_cpu_read to avoid holding any copy. +*/ + return __raw_cpu_read(, const_current_task); +} + +#else /* USE_X86_SEG_SUPPORT */ + +/* + * Without segment qualifier support, the per-cpu infrastrucutre is not + * suitable for reading constants, so use this_cpu_read_stable() in this case. + */ static __always_inline struct task_struct *get_current(void) { return this_cpu_read_stable(current_task); } +#endif /* USE_X86_SEG_SUPPORT */ + #define current get_current() #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index d7a1e5a9331c..d816f03a37d7 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -19,6 +19,7 @@ CFLAGS_common.o := $(nostackp) obj-y := cacheinfo.o scattered.o topology.o obj-y += common.o +obj-y += current.o obj-y += rdrand.o obj-y += match.o obj-y += bugs.o diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 5cc2d51cc25e..5f7c9ee57802 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1619,13 +1619,8 @@ DEFINE_PER_CPU_FIRST(struct fixed_percpu_data, EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data); /* - * The following percpu variables are hot. Align current_task to - * cacheline size such that they fall in the same cacheline. + * The following percpu variables are hot. */ -DEFINE_PER_CPU(struct task_struct *, current_task) cacheline_aligned = - _task; -EXPORT_PER_CPU_SYMBOL(current_task); - DEFINE_PER_CPU(struct irq_stack *, hardirq_stack_ptr); DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1; diff --git a/arch/x86/kernel/cpu/current.c b/arch/x86/kernel/cpu/current.c new file mode 100644 index ..3238c6e34984 --- /dev/null +++ b/arch/x86/kernel/cpu/current.c @@ -0,0 +1,16 @@ +#include +#include + +/* + * Align current_task to cacheline size such that they fall in the same + * cacheline. + */ +DEFINE_PER_CPU(struct task_struct *, current_task) cacheline_aligned = + _task; +EXPORT_PER_CPU_SYMBOL(current_task); + +#if USE_X86_SEG_SUPPORT +DECLARE_PER_CPU(struct task_struct * const __percpu_seg_override, const_current_task) + __attribute__((alias("current_task"))); +EXPORT_PER_CPU_SYMBOL(const_current_task); +#endif diff --git a/include/linux/compiler.h b/include/linux/compiler.h index f0fd5636fddb..1b6ee9ab6373 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -299,7 +299,7 @@ unsigned long read_word_at_a_time(const void *addr) */ #define __ADDRESSABLE(sym) \ static void * __section(".discard.addressable") __used \ - __PASTE(__addressable_##sym, __LINE__) = (void *) + __PASTE(__addressable_##sym, __LINE__) = (void *)(uintptr_t) /** * offset_to_ptr - convert a relative memory offset to an absolute pointer -- 2.17.1
[PATCH 0/7] x86/percpu: Use segment qualifiers
GCC 6+ supports segment qualifiers. Using them allows to implement several optimizations: 1. Avoid unnecessary instructions when an operation is carried on read/written per-cpu value, and instead allow the compiler to set instructions that access per-cpu value directly. 2. Make this_cpu_ptr() more efficient and allow its value to be cached, since preemption must be disabled when this_cpu_ptr() is used. 3. Provide better alternative for this_cpu_read_stable() that caches values more efficiently using alias attribute to const variable. 4. Allow the compiler to perform other optimizations (e.g. CSE). 5. Use rip-relative addressing in per_cpu_read_stable(), which make it PIE-ready. "size" and Peter's compare do not seem to show the impact on code size reduction correctly. Summing the code size according to nm on defconfig shows a minor reduction from 11451310 to 11451310 (0.09%). RFC->v1: * Fixing i386 build bug * Moving chunk to the right place [Peter] Nadav Amit (7): compiler: Report x86 segment support x86/percpu: Use compiler segment prefix qualifier x86/percpu: Use C for percpu accesses when possible x86: Fix possible caching of current_task percpu: Assume preemption is disabled on per_cpu_ptr() x86/percpu: Optimized arch_raw_cpu_ptr() x86/current: Aggressive caching of current arch/x86/include/asm/current.h | 30 +++ arch/x86/include/asm/fpu/internal.h| 7 +- arch/x86/include/asm/percpu.h | 293 +++-- arch/x86/include/asm/preempt.h | 3 +- arch/x86/include/asm/resctrl_sched.h | 14 +- arch/x86/kernel/cpu/Makefile | 1 + arch/x86/kernel/cpu/common.c | 7 +- arch/x86/kernel/cpu/current.c | 16 ++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +- arch/x86/kernel/process_32.c | 4 +- arch/x86/kernel/process_64.c | 4 +- include/asm-generic/percpu.h | 12 + include/linux/compiler-gcc.h | 4 + include/linux/compiler.h | 2 +- include/linux/percpu-defs.h| 33 ++- 15 files changed, 346 insertions(+), 88 deletions(-) create mode 100644 arch/x86/kernel/cpu/current.c -- 2.17.1
[PATCH 2/7] x86/percpu: Use compiler segment prefix qualifier
Using a segment prefix qualifier is cleaner than using a segment prefix in the inline assembly, and provides the compiler with more information, telling it that __seg_gs:[addr] is different than [addr] when it analyzes data dependencies. It also enables various optimizations that will be implemented in the next patches. Use segment prefix qualifiers when they are supported. Unfortunately, gcc does not provide a way to remove segment qualifiers, which is needed to use typeof() to create local instances of the per-cpu variable. For this reason, do not use the segment qualifier for per-cpu variables, and do casting using the segment qualifier instead. Signed-off-by: Nadav Amit --- arch/x86/include/asm/percpu.h | 153 ++--- arch/x86/include/asm/preempt.h | 3 +- 2 files changed, 104 insertions(+), 52 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 2278797c769d..1fe348884477 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -45,8 +45,33 @@ #include #include +#if defined(COMPILER_HAS_X86_SEG_SUPPORT) && IS_ENABLED(CONFIG_SMP) +#define USE_X86_SEG_SUPPORT1 +#else +#define USE_X86_SEG_SUPPORT0 +#endif + #ifdef CONFIG_SMP + +#if USE_X86_SEG_SUPPORT + +#ifdef CONFIG_X86_64 +#define __percpu_seg_override __seg_gs +#else +#define __percpu_seg_override __seg_fs +#endif + +#define __percpu_prefix"" +#define __force_percpu_prefix "%%"__stringify(__percpu_seg)":" + +#else /* USE_X86_SEG_SUPPORT */ + +#define __percpu_seg_override #define __percpu_prefix"%%"__stringify(__percpu_seg)":" +#define __force_percpu_prefix __percpu_prefix + +#endif /* USE_X86_SEG_SUPPORT */ + #define __my_cpu_offsetthis_cpu_read(this_cpu_off) /* @@ -58,14 +83,21 @@ unsigned long tcp_ptr__;\ asm volatile("add " __percpu_arg(1) ", %0" \ : "=r" (tcp_ptr__) \ -: "m" (this_cpu_off), "0" (ptr)); \ +: "m" (__my_cpu_var(this_cpu_off)),\ + "0" (ptr)); \ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ }) -#else +#else /* CONFIG_SMP */ +#define __percpu_seg_override #define __percpu_prefix"" -#endif +#define __force_percpu_prefix "" +#endif /* CONFIG_SMP */ +#define __my_cpu_type(var) typeof(var) __percpu_seg_override +#define __my_cpu_ptr(ptr) (__my_cpu_type(*ptr) *)(uintptr_t)(ptr) +#define __my_cpu_var(var) (*__my_cpu_ptr()) #define __percpu_arg(x)__percpu_prefix "%" #x +#define __force_percpu_arg(x) __force_percpu_prefix "%" #x /* * Initialized pointers to per-cpu variables needed for the boot @@ -98,22 +130,22 @@ do { \ switch (sizeof(var)) { \ case 1: \ asm qual (op "b %1,"__percpu_arg(0) \ - : "+m" (var)\ + : "+m" (__my_cpu_var(var)) \ : "qi" ((pto_T__)(val))); \ break; \ case 2: \ asm qual (op "w %1,"__percpu_arg(0) \ - : "+m" (var)\ + : "+m" (__my_cpu_var(var)) \ : "ri" ((pto_T__)(val))); \ break; \ case 4: \ asm qual (op "l %1,"__percpu_arg(0) \ - : "+m" (var)\ + : "+m" (__my_cpu_var(var)) \ : "ri" ((pto_T__)(val))); \ break; \ case 8: \ asm qual (op "q %1,"__percpu_arg(0) \ - : "+m" (var)\ + : "+m" (__my_cpu_var(var)) \ : "re" ((pto_T__)(val))); \ break; \ default: __bad_percpu_size(); \ @@ -138,71 +170,79 @@ do { \ switch (sizeof(var)) { \ case 1: \
[PATCH 6/7] x86/percpu: Optimized arch_raw_cpu_ptr()
Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform better optimizations, such as setting an appropriate base to compute the address instead of an add instruction. The benefit of this computation is relevant only when using compiler segment qualifiers. It is inapplicable to use this method when the address size is greater than the maximum operand size, as it is when building vdso32. Distinguish between the two cases in which preemption is disabled (as happens when this_cpu_ptr() is used) and enabled (when raw_cpu_ptr() is used). This allows optimizations, for instance in rcu_dynticks_eqs_exit(), the following code: mov$0x2bbc0,%rax add%gs:0x7ef07570(%rip),%rax # 0x10358 lock xadd %edx,0xd8(%rax) Turns with this patch into: mov%gs:0x7ef08aa5(%rip),%rax # 0x10358 lock xadd %edx,0x2bc58(%rax) Signed-off-by: Nadav Amit --- arch/x86/include/asm/percpu.h | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 13987f9bc82f..3d82df27d45c 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -73,20 +73,41 @@ #endif /* USE_X86_SEG_SUPPORT */ #define __my_cpu_offsetthis_cpu_read(this_cpu_off) +#define __raw_my_cpu_offset__this_cpu_read(this_cpu_off) +#define __my_cpu_ptr(ptr) (__my_cpu_type(*ptr) *)(uintptr_t)(ptr) +#if USE_X86_SEG_SUPPORT && !defined(BUILD_VDSO32) && defined(CONFIG_X86_64) +/* + * Efficient implementation for cases in which the compiler supports C segments. + * Allows the compiler to perform additional optimizations that can save more + * instructions. + * + * This optimized version can only be used if the pointer size equals to native + * operand size, which does not happen when vdso32 is used. + */ +#define __arch_raw_cpu_ptr_qual(qual, ptr) \ +({ \ + (qual typeof(*(ptr)) __kernel __force *)((uintptr_t)(ptr) + \ + __my_cpu_offset); \ +}) +#else /* USE_X86_SEG_SUPPORT && !defined(BUILD_VDSO32) && defined(CONFIG_X86_64) */ /* * Compared to the generic __my_cpu_offset version, the following * saves one instruction and avoids clobbering a temp register. */ -#define arch_raw_cpu_ptr(ptr) \ +#define __arch_raw_cpu_ptr_qual(qual, ptr) \ ({ \ unsigned long tcp_ptr__;\ - asm volatile("add " __percpu_arg(1) ", %0" \ + asm qual ("add " __percpu_arg(1) ", %0" \ : "=r" (tcp_ptr__) \ : "m" (__my_cpu_var(this_cpu_off)),\ "0" (ptr)); \ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ }) +#endif /* USE_X86_SEG_SUPPORT && !defined(BUILD_VDSO32) && defined(CONFIG_X86_64) */ + +#define arch_raw_cpu_ptr(ptr) __arch_raw_cpu_ptr_qual(volatile, ptr) +#define arch_raw_cpu_ptr_preemption_disabled(ptr) __arch_raw_cpu_ptr_qual( , ptr) #else /* CONFIG_SMP */ #define __percpu_seg_override #define __percpu_prefix"" -- 2.17.1
[PATCH v4 2/9] x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
The unification of these two functions allows to use them in the updated SMP infrastrucutre. To do so, remove the reason argument from flush_tlb_func_local(), add a member to struct tlb_flush_info that says which CPU initiated the flush and act accordingly. Optimize the size of flush_tlb_info while we are at it. Unfortunately, this prevents us from using a constant tlb_flush_info for arch_tlbbatch_flush(), but in a later stage we will inline tlb_flush_info into the IPI data, so it should not have an impact eventually. Reviewed-by: Dave Hansen Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Josh Poimboeuf Signed-off-by: Nadav Amit --- arch/x86/include/asm/tlbflush.h | 5 +- arch/x86/mm/tlb.c | 85 +++-- 2 files changed, 41 insertions(+), 49 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 6f66d841262d..2f6e9be163ae 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -559,8 +559,9 @@ struct flush_tlb_info { unsigned long start; unsigned long end; u64 new_tlb_gen; - unsigned intstride_shift; - boolfreed_tables; + unsigned intinitiating_cpu; + u8 stride_shift; + u8 freed_tables; }; #define local_flush_tlb() __flush_tlb() diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index e6a9edc5baaf..2674f55ed9a1 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -292,7 +292,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, * NB: leave_mm() calls us with prev == NULL and tsk == NULL. */ - /* We don't want flush_tlb_func_* to run concurrently with us. */ + /* We don't want flush_tlb_func() to run concurrently with us. */ if (IS_ENABLED(CONFIG_PROVE_LOCKING)) WARN_ON_ONCE(!irqs_disabled()); @@ -512,14 +512,13 @@ void initialize_tlbstate_and_flush(void) } /* - * flush_tlb_func_common()'s memory ordering requirement is that any + * flush_tlb_func()'s memory ordering requirement is that any * TLB fills that happen after we flush the TLB are ordered after we * read active_mm's tlb_gen. We don't need any explicit barriers * because all x86 flush operations are serializing and the * atomic64_read operation won't be reordered by the compiler. */ -static void flush_tlb_func_common(const struct flush_tlb_info *f, - bool local, enum tlb_flush_reason reason) +static void flush_tlb_func(void *info) { /* * We have three different tlb_gen values in here. They are: @@ -530,14 +529,26 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, * - f->new_tlb_gen: the generation that the requester of the flush * wants us to catch up to. */ + const struct flush_tlb_info *f = info; struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen); u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen); + bool local = smp_processor_id() == f->initiating_cpu; + unsigned long nr_invalidate = 0; /* This code cannot presently handle being reentered. */ VM_WARN_ON(!irqs_disabled()); + if (!local) { + inc_irq_stat(irq_tlb_count); + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); + + /* Can only happen on remote CPUs */ + if (f->mm && f->mm != loaded_mm) + return; + } + if (unlikely(loaded_mm == _mm)) return; @@ -565,8 +576,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, * be handled can catch us all the way up, leaving no work for * the second flush. */ - trace_tlb_flush(reason, 0); - return; + goto done; } WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen); @@ -613,46 +623,34 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, f->new_tlb_gen == local_tlb_gen + 1 && f->new_tlb_gen == mm_tlb_gen) { /* Partial flush */ - unsigned long nr_invalidate = (f->end - f->start) >> f->stride_shift; unsigned long addr = f->start; + nr_invalidate = (f->end - f->start) >> f->stride_shift; + while (addr < f->end) { __flush_tlb_one_user(addr); addr += 1UL << f->stride_shift; }
[PATCH v4 5/9] x86/mm/tlb: Privatize cpu_tlbstate
cpu_tlbstate is mostly private and only the variable is_lazy is shared. This causes some false-sharing when TLB flushes are performed. Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark each one accordingly. Cc: Andy Lutomirski Cc: Peter Zijlstra Reviewed-by: Dave Hansen Signed-off-by: Nadav Amit --- arch/x86/include/asm/tlbflush.h | 39 ++--- arch/x86/mm/init.c | 2 +- arch/x86/mm/tlb.c | 15 - 3 files changed, 31 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 559195f79c2f..1f88ea410ff3 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -178,23 +178,6 @@ struct tlb_state { u16 loaded_mm_asid; u16 next_asid; - /* -* We can be in one of several states: -* -* - Actively using an mm. Our CPU's bit will be set in -*mm_cpumask(loaded_mm) and is_lazy == false; -* -* - Not using a real mm. loaded_mm == _mm. Our CPU's bit -*will not be set in mm_cpumask(_mm) and is_lazy == false. -* -* - Lazily using a real mm. loaded_mm != _mm, our bit -*is set in mm_cpumask(loaded_mm), but is_lazy == true. -*We're heuristically guessing that the CR3 load we -*skipped more than makes up for the overhead added by -*lazy mode. -*/ - bool is_lazy; - /* * If set we changed the page tables in such a way that we * needed an invalidation of all contexts (aka. PCIDs / ASIDs). @@ -240,7 +223,27 @@ struct tlb_state { */ struct tlb_context ctxs[TLB_NR_DYN_ASIDS]; }; -DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate); +DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate); + +struct tlb_state_shared { + /* +* We can be in one of several states: +* +* - Actively using an mm. Our CPU's bit will be set in +*mm_cpumask(loaded_mm) and is_lazy == false; +* +* - Not using a real mm. loaded_mm == _mm. Our CPU's bit +*will not be set in mm_cpumask(_mm) and is_lazy == false. +* +* - Lazily using a real mm. loaded_mm != _mm, our bit +*is set in mm_cpumask(loaded_mm), but is_lazy == true. +*We're heuristically guessing that the CR3 load we +*skipped more than makes up for the overhead added by +*lazy mode. +*/ + bool is_lazy; +}; +DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared); /* * Blindly accessing user memory from NMI context can be dangerous diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index fd10d91a6115..34027f36a944 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -951,7 +951,7 @@ void __init zone_sizes_init(void) free_area_init_nodes(max_zone_pfns); } -__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { +__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = { .loaded_mm = _mm, .next_asid = 1, .cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 5376a5447bd0..24c9839e3d9b 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -145,7 +145,7 @@ void leave_mm(int cpu) return; /* Warn if we're not lazy. */ - WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy)); + WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy)); switch_mm(NULL, _mm, NULL); } @@ -277,7 +277,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, { struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm); u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); - bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy); + bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy); unsigned cpu = smp_processor_id(); u64 next_tlb_gen; bool need_flush; @@ -322,7 +322,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, __flush_tlb_all(); } #endif - this_cpu_write(cpu_tlbstate.is_lazy, false); + this_cpu_write(cpu_tlbstate_shared.is_lazy, false); /* * The membarrier system call requires a full memory barrier and @@ -463,7 +463,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) if (this_cpu_read(cpu_tlbstate.loaded_mm) == _mm) return; - this_cpu_write(cpu_tlbstate.is_lazy, true); + this_cpu_write(cpu_tlbstate_shared.is_lazy, true); } /* @@ -555,7 +555,7 @@ static void flush_tlb_func(void *info) VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) != loaded_mm->context.ctx
[PATCH v4 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
To improve TLB shootdown performance, flush the remote and local TLBs concurrently. Introduce flush_tlb_multi() that does so. Introduce paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen and hyper-v are only compile-tested). While the updated smp infrastructure is capable of running a function on a single local core, it is not optimized for this case. The multiple function calls and the indirect branch introduce some overhead, and might make local TLB flushes slower than they were before the recent changes. Before calling the SMP infrastructure, check if only a local TLB flush is needed to restore the lost performance in this common case. This requires to check mm_cpumask() one more time, but unless this mask is updated very frequently, this should impact performance negatively. Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Sasha Levin Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: x...@kernel.org Cc: Juergen Gross Cc: Paolo Bonzini Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Boris Ostrovsky Cc: linux-hyp...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: k...@vger.kernel.org Cc: xen-de...@lists.xenproject.org Reviewed-by: Michael Kelley # Hyper-v parts Reviewed-by: Juergen Gross # Xen and paravirt parts Reviewed-by: Dave Hansen Signed-off-by: Nadav Amit --- arch/x86/hyperv/mmu.c | 10 +++--- arch/x86/include/asm/paravirt.h | 6 ++-- arch/x86/include/asm/paravirt_types.h | 4 +-- arch/x86/include/asm/tlbflush.h | 8 ++--- arch/x86/include/asm/trace/hyperv.h | 2 +- arch/x86/kernel/kvm.c | 11 +-- arch/x86/kernel/paravirt.c| 2 +- arch/x86/mm/tlb.c | 45 ++- arch/x86/xen/mmu_pv.c | 11 +++ include/trace/events/xen.h| 2 +- 10 files changed, 61 insertions(+), 40 deletions(-) diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c index e65d7fe6489f..8740d8b21db3 100644 --- a/arch/x86/hyperv/mmu.c +++ b/arch/x86/hyperv/mmu.c @@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset, return gva_n - offset; } -static void hyperv_flush_tlb_others(const struct cpumask *cpus, - const struct flush_tlb_info *info) +static void hyperv_flush_tlb_multi(const struct cpumask *cpus, + const struct flush_tlb_info *info) { int cpu, vcpu, gva_n, max_gvas; struct hv_tlb_flush **flush_pcpu; @@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus, u64 status = U64_MAX; unsigned long flags; - trace_hyperv_mmu_flush_tlb_others(cpus, info); + trace_hyperv_mmu_flush_tlb_multi(cpus, info); if (!hv_hypercall_pg) goto do_native; @@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus, if (!(status & HV_HYPERCALL_RESULT_MASK)) return; do_native: - native_flush_tlb_others(cpus, info); + native_flush_tlb_multi(cpus, info); } static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus, @@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void) return; pr_info("Using hypercall for remote TLB flush\n"); - pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others; + pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi; pv_ops.mmu.tlb_remove_table = tlb_remove_table; } diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 69089d46f128..bc4829c9b3f9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr) PVOP_VCALL1(mmu.flush_tlb_one_user, addr); } -static inline void flush_tlb_others(const struct cpumask *cpumask, - const struct flush_tlb_info *info) +static inline void flush_tlb_multi(const struct cpumask *cpumask, + const struct flush_tlb_info *info) { - PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info); + PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info); } static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 70b654f3ffe5..63fa751344bf 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -206,8 +206,8 @@ struct pv_mmu_ops { void (*flush_tlb_user)(void); void (*flush_tlb_kernel)(void); void (*flush_tlb_one_user)(unsigned long addr); - void (*flush_tlb_others)(const struct cpumask *cpus, -const struct flush_tlb_info *info); + void (*flush_tlb_multi)(co
[PATCH v4 9/9] x86/mm/tlb: Remove unnecessary uses of the inline keyword
The compiler is smart enough without these hints. Cc: Andy Lutomirski Cc: Peter Zijlstra Suggested-by: Dave Hansen Reviewed-by: Dave Hansen Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 3dca146edcf1..0addc6e84126 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -189,7 +189,7 @@ static void sync_current_stack_to_mm(struct mm_struct *mm) } } -static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next) +static unsigned long mm_mangle_tif_spec_ib(struct task_struct *next) { unsigned long next_tif = task_thread_info(next)->flags; unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB; @@ -741,7 +741,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info); static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx); #endif -static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, +static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables, u64 new_tlb_gen) @@ -768,7 +768,7 @@ static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, return info; } -static inline void put_flush_tlb_info(void) +static void put_flush_tlb_info(void) { #ifdef CONFIG_DEBUG_VM /* Complete reentrency prevention checks */ -- 2.17.1
[PATCH v4 1/9] smp: Run functions concurrently in smp_call_function_many()
Currently, on_each_cpu() and similar functions do not exploit the potential of concurrency: the function is first executed remotely and only then it is executed locally. Functions such as TLB flush can take considerable time, so this provides an opportunity for performance optimization. To do so, introduce __smp_call_function_many(), which allows the callers to provide local and remote functions that should be executed, and run them concurrently. Keep smp_call_function_many() semantic as it is today for backward compatibility: the called function is not executed in this case locally. __smp_call_function_many() does not use the optimized version for a single remote target that smp_call_function_single() implements. For synchronous function call, smp_call_function_single() keeps a call_single_data (which is used for synchronization) on the stack. Interestingly, it seems that not using this optimization provides greater performance improvements (greater speedup with a single remote target than with multiple ones). Presumably, holding data structures that are intended for synchronization on the stack can introduce overheads due to TLB misses and false-sharing when the stack is used for other purposes. Adding support to run the functions concurrently required to remove a micro-optimization in on_each_cpu() that disabled/enabled IRQs instead of saving/restoring them. The benefit of running the local and remote code concurrently is expected to be greater. Reviewed-by: Dave Hansen Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Josh Poimboeuf Signed-off-by: Nadav Amit --- include/linux/smp.h | 34 --- kernel/smp.c| 138 +--- 2 files changed, 92 insertions(+), 80 deletions(-) diff --git a/include/linux/smp.h b/include/linux/smp.h index 6fc856c9eda5..d18d54199635 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -32,11 +32,6 @@ extern unsigned int total_cpus; int smp_call_function_single(int cpuid, smp_call_func_t func, void *info, int wait); -/* - * Call a function on all processors - */ -void on_each_cpu(smp_call_func_t func, void *info, int wait); - /* * Call a function on processors specified by mask, which might include * the local one. @@ -44,6 +39,17 @@ void on_each_cpu(smp_call_func_t func, void *info, int wait); void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func, void *info, bool wait); +/* + * Call a function on all processors. May be used during early boot while + * early_boot_irqs_disabled is set. + */ +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait) +{ + preempt_disable(); + on_each_cpu_mask(cpu_online_mask, func, info, wait); + preempt_enable(); +} + /* * Call a function on each processor for which the supplied function * cond_func returns a positive value. This may include the local @@ -102,8 +108,22 @@ extern void smp_cpus_done(unsigned int max_cpus); * Call a function on all other processors */ void smp_call_function(smp_call_func_t func, void *info, int wait); -void smp_call_function_many(const struct cpumask *mask, - smp_call_func_t func, void *info, bool wait); + +/* + * Flags to be used as as scf_flags argument of __smp_call_function_many(). + */ +#define SCF_WAIT (1U << 0) /* Wait until function execution completed */ +#define SCF_RUN_LOCAL (1U << 1) /* Run also locally if local cpu is set in cpumask */ + +void __smp_call_function_many(const struct cpumask *mask, + smp_call_func_t func, void *info, + unsigned int scf_flags); + +static inline void smp_call_function_many(const struct cpumask *mask, + smp_call_func_t func, void *info, bool wait) +{ + __smp_call_function_many(mask, func, info, wait ? SCF_WAIT : 0); +} int smp_call_function_any(const struct cpumask *mask, smp_call_func_t func, void *info, int wait); diff --git a/kernel/smp.c b/kernel/smp.c index 7dbcb402c2fc..9faec688b34b 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -396,24 +396,28 @@ int smp_call_function_any(const struct cpumask *mask, EXPORT_SYMBOL_GPL(smp_call_function_any); /** - * smp_call_function_many(): Run a function on a set of other CPUs. + * __smp_call_function_many(): Run a function on a set of CPUs. * @mask: The set of cpus to run on (only runs on online subset). * @func: The function to run. This must be fast and non-blocking. * @info: An arbitrary pointer to pass to the function. - * @wait: If true, wait (atomically) until function has completed - *on other CPUs. - * - * If @wait is true, then returns once @func has returned. + * @flags: Bitmask that controls the operation. If %SCF_WAIT is set, wait + *(atomically) until function has completed o
[PATCH v4 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to optimize the code. Open-coding eliminates the need for the indirect branch that is used to call is_lazy(), and in CPUs that are vulnerable to Spectre v2, it eliminates the retpoline. In addition, it allows to use a preallocated cpumask to compute the CPUs that should be. This would later allow us not to adapt on_each_cpu_cond_mask() to support local and remote functions. Note that calling tlb_is_not_lazy() for every CPU that needs to be flushed, as done in native_flush_tlb_multi() might look ugly, but it is equivalent to what is currently done in on_each_cpu_cond_mask(). Actually, native_flush_tlb_multi() does it more efficiently since it avoids using an indirect branch for the matter. Reviewed-by: Dave Hansen Cc: Peter Zijlstra Cc: Rik van Riel Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Josh Poimboeuf Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 40 +--- 1 file changed, 33 insertions(+), 7 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 2674f55ed9a1..c3ca3545d78a 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -653,11 +653,13 @@ static void flush_tlb_func(void *info) nr_invalidate); } -static bool tlb_is_not_lazy(int cpu, void *data) +static bool tlb_is_not_lazy(int cpu) { return !per_cpu(cpu_tlbstate.is_lazy, cpu); } +static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask); + void native_flush_tlb_others(const struct cpumask *cpumask, const struct flush_tlb_info *info) { @@ -701,12 +703,36 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * up on the new contents of what used to be page tables, while * doing a speculative memory access. */ - if (info->freed_tables) - smp_call_function_many(cpumask, flush_tlb_func, - (void *)info, 1); - else - on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func, - (void *)info, 1, GFP_ATOMIC, cpumask); + if (info->freed_tables) { + smp_call_function_many(cpumask, flush_tlb_func, (void *)info, 1); + } else { + /* +* Although we could have used on_each_cpu_cond_mask(), +* open-coding it has performance advantages, as it eliminates +* the need for indirect calls or retpolines. In addition, it +* allows to use a designated cpumask for evaluating the +* condition, instead of allocating one. +* +* This code works under the assumption that there are no nested +* TLB flushes, an assumption that is already made in +* flush_tlb_mm_range(). +* +* cond_cpumask is logically a stack-local variable, but it is +* more efficient to have it off the stack and not to allocate +* it on demand. Preemption is disabled and this code is +* non-reentrant. +*/ + struct cpumask *cond_cpumask = this_cpu_ptr(_tlb_mask); + int cpu; + + cpumask_clear(cond_cpumask); + + for_each_cpu(cpu, cpumask) { + if (tlb_is_not_lazy(cpu)) + __cpumask_set_cpu(cpu, cond_cpumask); + } + smp_call_function_many(cond_cpumask, flush_tlb_func, (void *)info, 1); + } } /* -- 2.17.1
[PATCH v4 0/9] x86/tlb: Concurrent TLB flushes
[ Similar cover-letter to v3 with updated performance numbers on skylake. Sorry for the time it since the last version. ] Currently, local and remote TLB flushes are not performed concurrently, which introduces unnecessary overhead - each PTE flushing can take 100s of cycles. This patch-set allows TLB flushes to be run concurrently: first request the remote CPUs to initiate the flush, then run it locally, and finally wait for the remote CPUs to finish their work. In addition, there are various small optimizations to avoid, for example, unwarranted false-sharing. The proposed changes should also improve the performance of other invocations of on_each_cpu(). Hopefully, no one has relied on this behavior of on_each_cpu() that invoked functions first remotely and only then locally [Peter says he remembers someone might do so, but without further information it is hard to know how to address it]. Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on 2-socket, 56-logical-cores (28+SMT) Skylake, 5 repetitions: sysbench fileio --file-total-size=3G --file-test-mode=rndwr \ --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run Th. tip-aug22 avg (stdev) +patch-set avg (stdev) change --- - -- -- 1 1152920 (7453) 1169469 (9059) +1.4% 2 1545832 (12555) 1584172 (10484) +2.4% 4 2480703 (12039) 2518641 (12875) +1.5% 8 3684486 (26007) 3840343 (44144) +4.2% 16 4981567 (23565) 5125756 (15458) +2.8% 32 5679542 (10116) 5887826 (6121) +3.6% 56 5630944 (17937) 5812514 (26264) +3.2% (Note that on configurations with up to 28 threads numactl was used to set all threads on socket 1, which explains the drop in performance when going to 32 threads). Running the same benchmark with security mitigations disabled (PTI, Spectre, MDS): Th. tip-aug22 avg (stdev) +patch-set avg (stdev) change --- - -- -- 1 1444119 (8524) 1469606 (10527) +1.7% 2 1921540 (24169) 1961899 (14450) +2.1% 4 3073716 (21786) 3199880 (16774) +4.1% 8 4700698 (49534) 4802312 (11043) +2.1% 16 6005180 (6366) 6006656 (31624)0% 32 6826466 (10496) 6886622 (19110) +0.8% 56 6832344 (13468) 6885586 (20646) +0.8% The results are somewhat different than the results that have been obtained on Haswell-X, which were sent before and the maximum performance improvement is smaller. However, the performance improvement is significant. v3 -> v4: * Merge flush_tlb_func_local and flush_tlb_func_remote() [Peter] * Prevent preemption on_each_cpu(). It is not needed, but it prevents concerns. [Peter/tglx] * Adding acked-, review-by tags v2 -> v3: * Open-code the remote/local-flush decision code [Andy] * Fix hyper-v, Xen implementations [Andrew] * Fix redundant TLB flushes. v1 -> v2: * Removing the patches that Thomas took [tglx] * Adding hyper-v, Xen compile-tested implementations [Dave] * Removing UV [Andy] * Adding lazy optimization, removing inline keyword [Dave] * Restructuring patch-set RFCv2 -> v1: * Fix comment on flush_tlb_multi [Juergen] * Removing async invalidation optimizations [Andy] * Adding KVM support [Paolo] Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Dave Hansen Cc: Haiyang Zhang Cc: Ingo Molnar Cc: Josh Poimboeuf Cc: Juergen Gross Cc: "K. Y. Srinivasan" Cc: Paolo Bonzini Cc: Peter Zijlstra Cc: Rik van Riel Cc: Sasha Levin Cc: Stephen Hemminger Cc: Thomas Gleixner Cc: k...@vger.kernel.org Cc: linux-hyp...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: x...@kernel.org Cc: xen-de...@lists.xenproject.org Nadav Amit (9): smp: Run functions concurrently in smp_call_function_many() x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote() x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() x86/mm/tlb: Flush remote and local TLBs concurrently x86/mm/tlb: Privatize cpu_tlbstate x86/mm/tlb: Do not make is_lazy dirty for no reason cpumask: Mark functions as pure x86/mm/tlb: Remove UV special case x86/mm/tlb: Remove unnecessary uses of the inline keyword arch/x86/hyperv/mmu.c | 10 +- arch/x86/include/asm/paravirt.h | 6 +- arch/x86/include/asm/paravirt_types.h | 4 +- arch/x86/include/asm/tlbflush.h | 52 +++ arch/x86/include/asm/trace/hyperv.h | 2 +- arch/x86/kernel/kvm.c | 11 +- arch/x86/kernel/paravirt.c| 2 +- arch/x86/mm/init.c| 2 +- arch/x86/mm/tlb.c | 195 ++ arch/x86/xen/mmu_pv.c | 11 +- i
[PATCH v4 6/9] x86/mm/tlb: Do not make is_lazy dirty for no reason
Blindly writing to is_lazy for no reason, when the written value is identical to the old value, makes the cacheline dirty for no reason. Avoid making such writes to prevent cache coherency traffic for no reason. Cc: Andy Lutomirski Cc: Peter Zijlstra Suggested-by: Dave Hansen Reviewed-by: Dave Hansen Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 24c9839e3d9b..1393b3cd3697 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -322,7 +322,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, __flush_tlb_all(); } #endif - this_cpu_write(cpu_tlbstate_shared.is_lazy, false); + if (was_lazy) + this_cpu_write(cpu_tlbstate_shared.is_lazy, false); /* * The membarrier system call requires a full memory barrier and -- 2.17.1
[PATCH v4 7/9] cpumask: Mark functions as pure
cpumask_next_and() and cpumask_any_but() are pure, and marking them as such seems to generate different and presumably better code for native_flush_tlb_multi(). Reviewed-by: Dave Hansen Signed-off-by: Nadav Amit --- include/linux/cpumask.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index b5a5a1ed9efd..2579700bf44a 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -225,7 +225,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp) return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits); } -unsigned int cpumask_next(int n, const struct cpumask *srcp); +unsigned int __pure cpumask_next(int n, const struct cpumask *srcp); /** * cpumask_next_zero - get the next unset cpu in a cpumask @@ -242,8 +242,8 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp) return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1); } -int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *); -int cpumask_any_but(const struct cpumask *mask, unsigned int cpu); +int __pure cpumask_next_and(int n, const struct cpumask *, const struct cpumask *); +int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu); unsigned int cpumask_local_spread(unsigned int i, int node); /** -- 2.17.1
[PATCH v4 8/9] x86/mm/tlb: Remove UV special case
SGI UV TLB flushes is outdated and will be replaced with compatible smp_call_many APIC function in the future. For now, simplify the code by removing the UV special case. Cc: Peter Zijlstra Suggested-by: Andy Lutomirski Acked-by: Mike Travis Reviewed-by: Dave Hansen Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 25 - 1 file changed, 25 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 1393b3cd3697..3dca146edcf1 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -679,31 +679,6 @@ void native_flush_tlb_multi(const struct cpumask *cpumask, trace_tlb_flush(TLB_REMOTE_SEND_IPI, (info->end - info->start) >> PAGE_SHIFT); - if (is_uv_system()) { - /* -* This whole special case is confused. UV has a "Broadcast -* Assist Unit", which seems to be a fancy way to send IPIs. -* Back when x86 used an explicit TLB flush IPI, UV was -* optimized to use its own mechanism. These days, x86 uses -* smp_call_function_many(), but UV still uses a manual IPI, -* and that IPI's action is out of date -- it does a manual -* flush instead of calling flush_tlb_func(). This -* means that the percpu tlb_gen variables won't be updated -* and we'll do pointless flushes on future context switches. -* -* Rather than hooking native_flush_tlb_multi() here, I think -* that UV should be updated so that smp_call_function_many(), -* etc, are optimal on UV. -*/ - flush_tlb_func((void *)info); - - cpumask = uv_flush_tlb_others(cpumask, info); - if (cpumask) - smp_call_function_many(cpumask, flush_tlb_func, - (void *)info, 1); - return; - } - /* * If no page tables were freed, we can skip sending IPIs to * CPUs in lazy TLB mode. They will flush the CPU themselves -- 2.17.1
Re: [PATCH] KVM: x86: Don't update RIP or do single-step on faulting emulation
> On Aug 23, 2019, at 1:55 PM, Sean Christopherson > wrote: > > Don't advance RIP or inject a single-step #DB if emulation signals a > fault. This logic applies to all state updates that are conditional on > clean retirement of the emulation instruction, e.g. updating RFLAGS was > previously handled by commit 38827dbd3fb85 ("KVM: x86: Do not update > EFLAGS on faulting emulation"). > > Not advancing RIP is likely a nop, i.e. ctxt->eip isn't updated with > ctxt->_eip until emulation "retires" anyways. Skipping #DB injection > fixes a bug reported by Andy Lutomirski where a #UD on SYSCALL due to > invalid state with RFLAGS.RF=1 would loop indefinitely due to emulation > overwriting the #UD with #DB and thus restarting the bad SYSCALL over > and over. > > Cc: Nadav Amit > Cc: sta...@vger.kernel.org > Reported-by: Andy Lutomirski > Fixes: 663f4c61b803 ("KVM: x86: handle singlestep during emulation") > Signed-off-by: Sean Christopherson Seems fine. I guess I should’ve found it before… Consider running the relevant self-tests (e.g., single_test_syscall) to avoid regressions.
Re: [PATCH] mm/balloon_compaction: suppress allocation warnings
> On Aug 21, 2019, at 12:13 PM, David Hildenbrand wrote: > > On 21.08.19 18:34, Nadav Amit wrote: >>> On Aug 21, 2019, at 9:29 AM, David Hildenbrand wrote: >>> >>> On 21.08.19 18:23, Nadav Amit wrote: >>>>> On Aug 21, 2019, at 9:05 AM, David Hildenbrand wrote: >>>>> >>>>> On 20.08.19 11:16, Nadav Amit wrote: >>>>>> There is no reason to print warnings when balloon page allocation fails, >>>>>> as they are expected and can be handled gracefully. Since VMware >>>>>> balloon now uses balloon-compaction infrastructure, and suppressed these >>>>>> warnings before, it is also beneficial to suppress these warnings to >>>>>> keep the same behavior that the balloon had before. >>>>> >>>>> I am not sure if that's a good idea. The allocation warnings are usually >>>>> the only trace of "the user/admin did something bad because he/she tried >>>>> to inflate the balloon to an unsafe value". Believe me, I processed a >>>>> couple of such bugreports related to virtio-balloon and the warning were >>>>> very helpful for that. >>>> >>>> Ok, so a message is needed, but does it have to be a generic frightening >>>> warning? >>>> >>>> How about using __GFP_NOWARN, and if allocation do something like: >>>> >>>> pr_warn(“Balloon memory allocation failed”); >>>> >>>> Or even something more informative? This would surely be less intimidating >>>> for common users. >>> >>> ratelimit would make sense :) >>> >>> And yes, this would certainly be nicer. >> >> Thanks. I will post v2 of the patch. > > As discussed in v2, we already print a warning in virtio-balloon, so I > am fine with this patch. > > Reviewed-by: David Hildenbrand Michael, If it is possible to get it to 5.3, to avoid behavioral change for VMware balloon users, it would be great. Thanks, Nadav
Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings
> On Aug 21, 2019, at 12:12 PM, David Hildenbrand wrote: > > On 21.08.19 21:10, Nadav Amit wrote: >>> On Aug 21, 2019, at 12:06 PM, David Hildenbrand wrote: >>> >>> On 21.08.19 20:59, Nadav Amit wrote: >>>>> On Aug 21, 2019, at 11:57 AM, David Hildenbrand wrote: >>>>> >>>>> On 21.08.19 11:41, Nadav Amit wrote: >>>>>> There is no reason to print generic warnings when balloon memory >>>>>> allocation fails, as failures are expected and can be handled >>>>>> gracefully. Since VMware balloon now uses balloon-compaction >>>>>> infrastructure, and suppressed these warnings before, it is also >>>>>> beneficial to suppress these warnings to keep the same behavior that the >>>>>> balloon had before. >>>>>> >>>>>> Since such warnings can still be useful to indicate that the balloon is >>>>>> over-inflated, print more informative and less frightening warning if >>>>>> allocation fails instead. >>>>>> >>>>>> Cc: David Hildenbrand >>>>>> Cc: Jason Wang >>>>>> Signed-off-by: Nadav Amit >>>>>> >>>>>> --- >>>>>> >>>>>> v1->v2: >>>>>> * Print informative warnings instead suppressing [David] >>>>>> --- >>>>>> mm/balloon_compaction.c | 7 ++- >>>>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>>>> >>>>>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c >>>>>> index 798275a51887..0c1d1f7689f0 100644 >>>>>> --- a/mm/balloon_compaction.c >>>>>> +++ b/mm/balloon_compaction.c >>>>>> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue); >>>>>> struct page *balloon_page_alloc(void) >>>>>> { >>>>>> struct page *page = alloc_page(balloon_mapping_gfp_mask() | >>>>>> - __GFP_NOMEMALLOC | >>>>>> __GFP_NORETRY); >>>>>> + __GFP_NOMEMALLOC | __GFP_NORETRY >>>>>> | >>>>>> + __GFP_NOWARN); >>>>>> + >>>>>> +if (!page) >>>>>> +pr_warn_ratelimited("memory balloon: memory allocation >>>>>> failed"); >>>>>> + >>>>>> return page; >>>>>> } >>>>>> EXPORT_SYMBOL_GPL(balloon_page_alloc); >>>>> >>>>> Not sure if "memory balloon" is the right wording. hmmm. >>>>> >>>>> Acked-by: David Hildenbrand >>>> >>>> Do you have a better suggestion? >>> >>> Not really - that's why I ack'ed :) >>> >>> However, thinking about it - what about moving the check + print to the >>> caller and then using dev_warn... or sth. like simple "virtio_balloon: >>> ..." ? You can then drop the warning for vmware balloon if you feel like >>> not needing it. >> >> Actually, there is already a warning that is printed by the virtue_balloon >> in fill_balloon(): >> >>struct page *page = balloon_page_alloc(); >> >>if (!page) { >>dev_info_ratelimited(>vdev->dev, >> "Out of puff! Can't get %u >> pages\n", >> VIRTIO_BALLOON_PAGES_PER_PAGE); >>/* Sleep for at least 1/5 of a second before retry. */ >>msleep(200); >>break; >>} >> >> So are you ok with going back to v1? > > Whoops, I missed that - sorry - usually the warnings scream louder at me :D > > Yes, v1 is fine with me! Thanks, I missed this one too. This change should prevent making users concerned for no good reason.
Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings
> On Aug 21, 2019, at 12:06 PM, David Hildenbrand wrote: > > On 21.08.19 20:59, Nadav Amit wrote: >>> On Aug 21, 2019, at 11:57 AM, David Hildenbrand wrote: >>> >>> On 21.08.19 11:41, Nadav Amit wrote: >>>> There is no reason to print generic warnings when balloon memory >>>> allocation fails, as failures are expected and can be handled >>>> gracefully. Since VMware balloon now uses balloon-compaction >>>> infrastructure, and suppressed these warnings before, it is also >>>> beneficial to suppress these warnings to keep the same behavior that the >>>> balloon had before. >>>> >>>> Since such warnings can still be useful to indicate that the balloon is >>>> over-inflated, print more informative and less frightening warning if >>>> allocation fails instead. >>>> >>>> Cc: David Hildenbrand >>>> Cc: Jason Wang >>>> Signed-off-by: Nadav Amit >>>> >>>> --- >>>> >>>> v1->v2: >>>> * Print informative warnings instead suppressing [David] >>>> --- >>>> mm/balloon_compaction.c | 7 ++- >>>> 1 file changed, 6 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c >>>> index 798275a51887..0c1d1f7689f0 100644 >>>> --- a/mm/balloon_compaction.c >>>> +++ b/mm/balloon_compaction.c >>>> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue); >>>> struct page *balloon_page_alloc(void) >>>> { >>>>struct page *page = alloc_page(balloon_mapping_gfp_mask() | >>>> - __GFP_NOMEMALLOC | __GFP_NORETRY); >>>> + __GFP_NOMEMALLOC | __GFP_NORETRY | >>>> + __GFP_NOWARN); >>>> + >>>> + if (!page) >>>> + pr_warn_ratelimited("memory balloon: memory allocation failed"); >>>> + >>>>return page; >>>> } >>>> EXPORT_SYMBOL_GPL(balloon_page_alloc); >>> >>> Not sure if "memory balloon" is the right wording. hmmm. >>> >>> Acked-by: David Hildenbrand >> >> Do you have a better suggestion? > > Not really - that's why I ack'ed :) > > However, thinking about it - what about moving the check + print to the > caller and then using dev_warn... or sth. like simple "virtio_balloon: > ..." ? You can then drop the warning for vmware balloon if you feel like > not needing it. Actually, there is already a warning that is printed by the virtue_balloon in fill_balloon(): struct page *page = balloon_page_alloc(); if (!page) { dev_info_ratelimited(>vdev->dev, "Out of puff! Can't get %u pages\n", VIRTIO_BALLOON_PAGES_PER_PAGE); /* Sleep for at least 1/5 of a second before retry. */ msleep(200); break; } So are you ok with going back to v1?
Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings
> On Aug 21, 2019, at 11:57 AM, David Hildenbrand wrote: > > On 21.08.19 11:41, Nadav Amit wrote: >> There is no reason to print generic warnings when balloon memory >> allocation fails, as failures are expected and can be handled >> gracefully. Since VMware balloon now uses balloon-compaction >> infrastructure, and suppressed these warnings before, it is also >> beneficial to suppress these warnings to keep the same behavior that the >> balloon had before. >> >> Since such warnings can still be useful to indicate that the balloon is >> over-inflated, print more informative and less frightening warning if >> allocation fails instead. >> >> Cc: David Hildenbrand >> Cc: Jason Wang >> Signed-off-by: Nadav Amit >> >> --- >> >> v1->v2: >> * Print informative warnings instead suppressing [David] >> --- >> mm/balloon_compaction.c | 7 ++- >> 1 file changed, 6 insertions(+), 1 deletion(-) >> >> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c >> index 798275a51887..0c1d1f7689f0 100644 >> --- a/mm/balloon_compaction.c >> +++ b/mm/balloon_compaction.c >> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue); >> struct page *balloon_page_alloc(void) >> { >> struct page *page = alloc_page(balloon_mapping_gfp_mask() | >> - __GFP_NOMEMALLOC | __GFP_NORETRY); >> + __GFP_NOMEMALLOC | __GFP_NORETRY | >> + __GFP_NOWARN); >> + >> +if (!page) >> +pr_warn_ratelimited("memory balloon: memory allocation failed"); >> + >> return page; >> } >> EXPORT_SYMBOL_GPL(balloon_page_alloc); > > Not sure if "memory balloon" is the right wording. hmmm. > > Acked-by: David Hildenbrand Do you have a better suggestion?
[PATCH v2] mm/balloon_compaction: Informative allocation warnings
There is no reason to print generic warnings when balloon memory allocation fails, as failures are expected and can be handled gracefully. Since VMware balloon now uses balloon-compaction infrastructure, and suppressed these warnings before, it is also beneficial to suppress these warnings to keep the same behavior that the balloon had before. Since such warnings can still be useful to indicate that the balloon is over-inflated, print more informative and less frightening warning if allocation fails instead. Cc: David Hildenbrand Cc: Jason Wang Signed-off-by: Nadav Amit --- v1->v2: * Print informative warnings instead suppressing [David] --- mm/balloon_compaction.c | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index 798275a51887..0c1d1f7689f0 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue); struct page *balloon_page_alloc(void) { struct page *page = alloc_page(balloon_mapping_gfp_mask() | - __GFP_NOMEMALLOC | __GFP_NORETRY); + __GFP_NOMEMALLOC | __GFP_NORETRY | + __GFP_NOWARN); + + if (!page) + pr_warn_ratelimited("memory balloon: memory allocation failed"); + return page; } EXPORT_SYMBOL_GPL(balloon_page_alloc); -- 2.17.1
Re: [PATCH] mm/balloon_compaction: suppress allocation warnings
> On Aug 21, 2019, at 9:29 AM, David Hildenbrand wrote: > > On 21.08.19 18:23, Nadav Amit wrote: >>> On Aug 21, 2019, at 9:05 AM, David Hildenbrand wrote: >>> >>> On 20.08.19 11:16, Nadav Amit wrote: >>>> There is no reason to print warnings when balloon page allocation fails, >>>> as they are expected and can be handled gracefully. Since VMware >>>> balloon now uses balloon-compaction infrastructure, and suppressed these >>>> warnings before, it is also beneficial to suppress these warnings to >>>> keep the same behavior that the balloon had before. >>> >>> I am not sure if that's a good idea. The allocation warnings are usually >>> the only trace of "the user/admin did something bad because he/she tried >>> to inflate the balloon to an unsafe value". Believe me, I processed a >>> couple of such bugreports related to virtio-balloon and the warning were >>> very helpful for that. >> >> Ok, so a message is needed, but does it have to be a generic frightening >> warning? >> >> How about using __GFP_NOWARN, and if allocation do something like: >> >> pr_warn(“Balloon memory allocation failed”); >> >> Or even something more informative? This would surely be less intimidating >> for common users. > > ratelimit would make sense :) > > And yes, this would certainly be nicer. Thanks. I will post v2 of the patch.
Re: [PATCH] mm/balloon_compaction: suppress allocation warnings
> On Aug 21, 2019, at 9:05 AM, David Hildenbrand wrote: > > On 20.08.19 11:16, Nadav Amit wrote: >> There is no reason to print warnings when balloon page allocation fails, >> as they are expected and can be handled gracefully. Since VMware >> balloon now uses balloon-compaction infrastructure, and suppressed these >> warnings before, it is also beneficial to suppress these warnings to >> keep the same behavior that the balloon had before. > > I am not sure if that's a good idea. The allocation warnings are usually > the only trace of "the user/admin did something bad because he/she tried > to inflate the balloon to an unsafe value". Believe me, I processed a > couple of such bugreports related to virtio-balloon and the warning were > very helpful for that. Ok, so a message is needed, but does it have to be a generic frightening warning? How about using __GFP_NOWARN, and if allocation do something like: pr_warn(“Balloon memory allocation failed”); Or even something more informative? This would surely be less intimidating for common users.
[PATCH] VMCI: Release resource if the work is already queued
Francois reported that VMware balloon gets stuck after a balloon reset, when the VMCI doorbell is removed. A similar error can occur when the balloon driver is removed with the following splat: [ 1088.622000] INFO: task modprobe:3565 blocked for more than 120 seconds. [ 1088.622035] Tainted: GW 5.2.0 #4 [ 1088.622087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 1088.622205] modprobeD0 3565 1450 0x [ 1088.622210] Call Trace: [ 1088.622246] __schedule+0x2a8/0x690 [ 1088.622248] schedule+0x2d/0x90 [ 1088.622250] schedule_timeout+0x1d3/0x2f0 [ 1088.622252] wait_for_completion+0xba/0x140 [ 1088.622320] ? wake_up_q+0x80/0x80 [ 1088.622370] vmci_resource_remove+0xb9/0xc0 [vmw_vmci] [ 1088.622373] vmci_doorbell_destroy+0x9e/0xd0 [vmw_vmci] [ 1088.622379] vmballoon_vmci_cleanup+0x6e/0xf0 [vmw_balloon] [ 1088.622381] vmballoon_exit+0x18/0xcc8 [vmw_balloon] [ 1088.622394] __x64_sys_delete_module+0x146/0x280 [ 1088.622408] do_syscall_64+0x5a/0x130 [ 1088.622410] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 1088.622415] RIP: 0033:0x7f54f62791b7 [ 1088.622421] Code: Bad RIP value. [ 1088.622421] RSP: 002b:7fff2a949008 EFLAGS: 0206 ORIG_RAX: 00b0 [ 1088.622426] RAX: ffda RBX: 55dff8b55d00 RCX: 7f54f62791b7 [ 1088.622426] RDX: RSI: 0800 RDI: 55dff8b55d68 [ 1088.622427] RBP: 55dff8b55d00 R08: 7fff2a947fb1 R09: [ 1088.622427] R10: 7f54f62f5cc0 R11: 0206 R12: 55dff8b55d68 [ 1088.622428] R13: 0001 R14: 55dff8b55d68 R15: 7fff2a94a3f0 The cause for the bug is that when the "delayed" doorbell is invoked, it takes a reference on the doorbell entry and schedules work that is supposed to run the appropriate code and drop the doorbell entry reference. The code ignores the fact that if the work is already queued, it will not be scheduled to run one more time. As a result one of the references would not be dropped. When the code waits for the reference to get to zero, during balloon reset or module removal, it gets stuck. Fix it. Drop the reference if schedule_work() indicates that the work is already queued. Note that this bug got more apparent (or apparent at all) due to commit ce664331b248 ("vmw_balloon: VMCI_DOORBELL_SET does not check status"). Fixes: 83e2ec765be03 ("VMCI: doorbell implementation.") Reported-by: Francois Rigault Cc: Jorgen Hansen Cc: Adit Ranadive Cc: Alexios Zavras Cc: Vishnu DASA Cc: sta...@vger.kernel.org Signed-off-by: Nadav Amit --- drivers/misc/vmw_vmci/vmci_doorbell.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/drivers/misc/vmw_vmci/vmci_doorbell.c b/drivers/misc/vmw_vmci/vmci_doorbell.c index bad89b6e0802..345addd9306d 100644 --- a/drivers/misc/vmw_vmci/vmci_doorbell.c +++ b/drivers/misc/vmw_vmci/vmci_doorbell.c @@ -310,7 +310,8 @@ int vmci_dbell_host_context_notify(u32 src_cid, struct vmci_handle handle) entry = container_of(resource, struct dbell_entry, resource); if (entry->run_delayed) { - schedule_work(>work); + if (!schedule_work(>work)) + vmci_resource_put(resource); } else { entry->notify_cb(entry->client_data); vmci_resource_put(resource); @@ -361,7 +362,8 @@ static void dbell_fire_entries(u32 notify_idx) atomic_read(>active) == 1) { if (dbell->run_delayed) { vmci_resource_get(>resource); - schedule_work(>work); + if (!schedule_work(>work)) + vmci_resource_put(>resource); } else { dbell->notify_cb(dbell->client_data); } -- 2.19.1
[PATCH] vmw_balloon: Fix offline page marking with compaction
The compaction code already marks pages as offline when it enqueues pages in the ballooned page list, and removes the mapping when the pages are removed from the list. VMware balloon also updates the flags, instead of letting the balloon-compaction logic handle it, which causes the assertion VM_BUG_ON_PAGE(!PageOffline(page)) to fire, when __ClearPageOffline is called the second time. This causes the following crash. [ 487.104520] kernel BUG at include/linux/page-flags.h:749! [ 487.106364] invalid opcode: [#1] SMP DEBUG_PAGEALLOC PTI [ 487.107681] CPU: 7 PID: 1106 Comm: kworker/7:3 Not tainted 5.3.0-rc5balloon #227 [ 487.109196] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018 [ 487.111452] Workqueue: events_freezable vmballoon_work [vmw_balloon] [ 487.112779] RIP: 0010:vmballoon_release_page_list+0xaa/0x100 [vmw_balloon] [ 487.114200] Code: fe 48 c1 e7 06 4c 01 c7 8b 47 30 41 89 c1 41 81 e1 00 01 00 f0 41 81 f9 00 00 00 f0 74 d3 48 c7 c6 08 a1 a1 c0 e8 06 0d e7 ea <0f> 0b 44 89 f6 4c 89 c7 e8 49 9c e9 ea 49 8d 75 08 49 8b 45 08 4d [ 487.118033] RSP: 0018:b82f012bbc98 EFLAGS: 00010246 [ 487.119135] RAX: 0037 RBX: 0001 RCX: 0006 [ 487.120601] RDX: RSI: RDI: 9a85b6bd7620 [ 487.122071] RBP: b82f012bbcc0 R08: 0001 R09: [ 487.123536] R10: R11: R12: b82f012bbd00 [ 487.125002] R13: e97f4598d9c0 R14: R15: b82f012bbd34 [ 487.126463] FS: () GS:9a85b6bc() knlGS: [ 487.128110] CS: 0010 DS: ES: CR0: 80050033 [ 487.129316] CR2: 7ffe6e413ea0 CR3: 000230b18001 CR4: 003606e0 [ 487.130812] DR0: DR1: DR2: [ 487.132283] DR3: DR6: fffe0ff0 DR7: 0400 [ 487.133749] Call Trace: [ 487.134333] vmballoon_deflate+0x22c/0x390 [vmw_balloon] [ 487.135468] vmballoon_work+0x6e7/0x913 [vmw_balloon] [ 487.136711] ? process_one_work+0x21a/0x5e0 [ 487.138581] process_one_work+0x298/0x5e0 [ 487.139926] ? vmballoon_migratepage+0x310/0x310 [vmw_balloon] [ 487.141610] ? process_one_work+0x298/0x5e0 [ 487.143053] worker_thread+0x41/0x400 [ 487.144389] kthread+0x12b/0x150 [ 487.145582] ? process_one_work+0x5e0/0x5e0 [ 487.146937] ? kthread_create_on_node+0x60/0x60 [ 487.148637] ret_from_fork+0x3a/0x50 Fix it by updating the PageOffline indication only when a 2MB page is enqueued and dequeued. The 4KB pages will be handled correctly by the balloon compaction logic. Fixes: 83a8afa72e9c ("vmw_balloon: Compaction support") Cc: David Hildenbrand Reported-by: Thomas Hellstrom Signed-off-by: Nadav Amit --- drivers/misc/vmw_balloon.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c index 8840299420e0..5e6be1527571 100644 --- a/drivers/misc/vmw_balloon.c +++ b/drivers/misc/vmw_balloon.c @@ -691,7 +691,6 @@ static int vmballoon_alloc_page_list(struct vmballoon *b, } if (page) { - vmballoon_mark_page_offline(page, ctl->page_size); /* Success. Add the page to the list and continue. */ list_add(>lru, >pages); continue; @@ -930,7 +929,6 @@ static void vmballoon_release_page_list(struct list_head *page_list, list_for_each_entry_safe(page, tmp, page_list, lru) { list_del(>lru); - vmballoon_mark_page_online(page, page_size); __free_pages(page, vmballoon_page_order(page_size)); } @@ -1005,6 +1003,7 @@ static void vmballoon_enqueue_page_list(struct vmballoon *b, enum vmballoon_page_size_type page_size) { unsigned long flags; + struct page *page; if (page_size == VMW_BALLOON_4K_PAGE) { balloon_page_list_enqueue(>b_dev_info, pages); @@ -1014,6 +1013,11 @@ static void vmballoon_enqueue_page_list(struct vmballoon *b, * for the balloon compaction mechanism. */ spin_lock_irqsave(>b_dev_info.pages_lock, flags); + + list_for_each_entry(page, pages, lru) { + vmballoon_mark_page_offline(page, VMW_BALLOON_2M_PAGE); + } + list_splice_init(pages, >huge_pages); __count_vm_events(BALLOON_INFLATE, *n_pages * vmballoon_page_in_frames(VMW_BALLOON_2M_PAGE)); @@ -1056,6 +1060,8 @@ static void vmballoon_dequeue_page_list(struct vmballoon *b, /* 2MB pages */ spin_lock_irqsave(>b_dev_info.pages_lock, flags); list_for_eac
[PATCH] mm/balloon_compaction: suppress allocation warnings
There is no reason to print warnings when balloon page allocation fails, as they are expected and can be handled gracefully. Since VMware balloon now uses balloon-compaction infrastructure, and suppressed these warnings before, it is also beneficial to suppress these warnings to keep the same behavior that the balloon had before. Cc: Jason Wang Signed-off-by: Nadav Amit --- mm/balloon_compaction.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c index 798275a51887..26de020aae7b 100644 --- a/mm/balloon_compaction.c +++ b/mm/balloon_compaction.c @@ -124,7 +124,8 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue); struct page *balloon_page_alloc(void) { struct page *page = alloc_page(balloon_mapping_gfp_mask() | - __GFP_NOMEMALLOC | __GFP_NORETRY); + __GFP_NOMEMALLOC | __GFP_NORETRY | + __GFP_NOWARN); return page; } EXPORT_SYMBOL_GPL(balloon_page_alloc); -- 2.19.1
[PATCH] iommu/vt-d: Fix wrong analysis whether devices share the same bus
set_msi_sid_cb() is used to determine whether device aliases share the same bus, but it can provide false indications that aliases use the same bus when in fact they do not. The reason is that set_msi_sid_cb() assumes that pdev is fixed, while actually pci_for_each_dma_alias() can call fn() when pdev is set to a subordinate device. As a result, running an VM on ESX with VT-d emulation enabled can results in the log warning such as: DMAR: [INTR-REMAP] Request device [00:11.0] fault index 3b [fault reason 38] Blocked an interrupt request due to source-id verification failure This seems to cause additional ata errors such as: ata3.00: qc timeout (cmd 0xa1) ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4) These timeouts also cause boot to be much longer and other errors. Fix it by checking comparing the alias with the previous one instead. Fixes: 3f0c625c6ae71 ("iommu/vt-d: Allow interrupts from the entire bus for aliased devices") Cc: sta...@vger.kernel.org Cc: Logan Gunthorpe Cc: David Woodhouse Cc: Joerg Roedel Cc: Jacob Pan Signed-off-by: Nadav Amit --- drivers/iommu/intel_irq_remapping.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c index 4786ca061e31..81e43c1df7ec 100644 --- a/drivers/iommu/intel_irq_remapping.c +++ b/drivers/iommu/intel_irq_remapping.c @@ -376,13 +376,13 @@ static int set_msi_sid_cb(struct pci_dev *pdev, u16 alias, void *opaque) { struct set_msi_sid_data *data = opaque; + if (data->count == 0 || PCI_BUS_NUM(alias) == PCI_BUS_NUM(data->alias)) + data->busmatch_count++; + data->pdev = pdev; data->alias = alias; data->count++; - if (PCI_BUS_NUM(alias) == pdev->bus->number) - data->busmatch_count++; - return 0; } -- 2.17.1
Re: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case
> On Jul 18, 2019, at 7:25 PM, Mike Travis wrote: > > It is a fact that the UV is still the UV and SGI is now part of HPE. The > current external product is known as SuperDome Flex. It is both up to date > as well as very well maintained. The ACK I provided was an okay to change > the code, but please make the description accurate. Looking at the code again, if I remove the call to uv_flush_tlb_others(), is there any use for tlb_uv.c?
Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
> On Jul 25, 2019, at 5:36 AM, Thomas Gleixner wrote: > > On Mon, 22 Jul 2019, Nadav Amit wrote: >>> On Jul 22, 2019, at 11:51 AM, Thomas Gleixner wrote: >>> void on_each_cpu(void (*func) (void *info), void *info, int wait) >>> { >>> unsigned long flags; >>> >>> preempt_disable(); >>> smp_call_function(func, info, wait); >>> >>> smp_call_function() has another preempt_disable as it can be called >>> separately and it does: >>> >>> preempt_disable(); >>> smp_call_function_many(cpu_online_mask, func, info, wait); >>> >>> Your new on_each_cpu() implementation does not. So there is a >>> difference. Whether it matters or not is a different question, but that >>> needs to be explained and documented. >> >> Thanks for explaining - so your concern is for CPUs being offlined. >> >> But unless I am missing something: on_each_cpu() calls __on_each_cpu_mask(), >> which disables preemption and calls __smp_call_function_many(). >> >> Then __smp_call_function_many() runs: >> >> cpumask_and(cfd->cpumask, mask, cpu_online_mask); >> >> … before choosing which remote CPUs should run the function. So the only >> case that I was missing is if the current CPU goes away and the function is >> called locally. >> >> Can it happen? I can add documentation and a debug assertion for this case. > > I don't think it can happen: > > on_each_cpu() >on_each_cpu_mask() > preempt_disable() >__smp_call_function_many() > > So if a CPU goes offline between on_each_cpu() and preempt_disable() then > there is no damage. After the preempt_disable() it can't go away anymore > and the task executing this cannot be migrated either. > > So yes, it's safe, but please add a big fat comment so future readers won't > be puzzled. I will do. I will need some more time to respin the next version. I see that what I build on top of it might require some changes, and I want to minimize them.
Re: [RFC 7/7] x86/current: Aggressive caching of current
> On Jul 22, 2019, at 2:07 PM, Peter Zijlstra wrote: > > On Thu, Jul 18, 2019 at 10:41:10AM -0700, Nadav Amit wrote: >> The current_task is supposed to be constant in each thread and therefore >> does not need to be reread. There is already an attempt to cache it >> using inline assembly, using this_cpu_read_stable(), which hides the >> dependency on the read memory address. > > Is that what it does?!, I never quite could understand > percpu_stable_op(). That’s my understanding. I am not too pleased that I could not come up with a general alternative to this_cpu_read_stable(), mainly because gcc does not provide a way to get the type without the segment qualifier. Anyhow, “current” seems to be the main pain-point. I think a similar const-alias approach can also be used for stuff like boot_cpu_has(). I have some patches for that somewhere, but the impact is smaller. I do see some small, but measurable performance improvements with this series. I’ll try to incorporate them in v1 once I have time.
Re: [RFC 3/7] x86/percpu: Use C for percpu accesses when possible
> On Jul 22, 2019, at 1:52 PM, Peter Zijlstra wrote: > > On Thu, Jul 18, 2019 at 10:41:06AM -0700, Nadav Amit wrote: > >> diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h >> index 99a7fa9ab0a3..60f97b288004 100644 >> --- a/arch/x86/include/asm/preempt.h >> +++ b/arch/x86/include/asm/preempt.h >> @@ -91,7 +91,8 @@ static __always_inline void __preempt_count_sub(int val) >> */ >> static __always_inline bool __preempt_count_dec_and_test(void) >> { >> -return GEN_UNARY_RMWcc("decl", __preempt_count, e, __percpu_arg([var])); >> +return GEN_UNARY_RMWcc("decl", __my_cpu_var(__preempt_count), e, >> + __percpu_arg([var])); >> } > > Should this be in the previous patch? Yes, it should.
Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
> On Jul 22, 2019, at 12:47 PM, Rasmus Villemoes > wrote: > > On 19/07/2019 02.58, Nadav Amit wrote: > >> /* >> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct >> arch_tlbflush_unmap_batch *batch) >> if (cpumask_test_cpu(cpu, >cpumask)) { >> lockdep_assert_irqs_enabled(); >> local_irq_disable(); >> -flush_tlb_func_local(_flush_tlb_info); >> +flush_tlb_func_local((void *)_flush_tlb_info); >> local_irq_enable(); >> } > > I think the confusion could be cleared up if you moved this hunk to > patch 2 where it belongs - i.e. where you change the prototype of > flush_tlb_func_local() and hence introduce the warning. Yes, there is a small mess here - the constification should actually go to a different patch… I’ll fix it.
Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
> On Jul 22, 2019, at 12:14 PM, Peter Zijlstra wrote: > > On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote: >> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask >> *cpumask, >> * doing a speculative memory access. >> */ >> if (info->freed_tables) { >> -smp_call_function_many(cpumask, flush_tlb_func_remote, >> - (void *)info, 1); >> +__smp_call_function_many(cpumask, flush_tlb_func_remote, >> + flush_tlb_func_local, >> + (void *)info, 1); >> } else { >> /* >> * Although we could have used on_each_cpu_cond_mask(), >> @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask >> *cpumask, >> if (tlb_is_not_lazy(cpu)) >> __cpumask_set_cpu(cpu, cond_cpumask); >> } >> -smp_call_function_many(cond_cpumask, flush_tlb_func_remote, >> +__smp_call_function_many(cond_cpumask, flush_tlb_func_remote, >> + flush_tlb_func_local, >> (void *)info, 1); >> } >> } > > Do we really need that _local/_remote distinction? ISTR you had a patch > that frobbed flush_tlb_info into the csd and that gave space > constraints, but I'm not seeing that here (probably a wise, get stuff > merged etc..). > > struct __call_single_data { >struct llist_node llist;/* 0 8 */ >smp_call_func_tfunc; /* 8 8 */ >void * info; /*16 8 */ >unsigned int flags;/*24 4 */ > >/* size: 32, cachelines: 1, members: 4 */ >/* padding: 4 */ >/* last cacheline: 32 bytes */ > }; > > struct flush_tlb_info { >struct mm_struct * mm; /* 0 8 */ >long unsigned int start;/* 8 8 */ >long unsigned int end; /*16 8 */ >u64new_tlb_gen; /*24 8 */ >unsigned int stride_shift; /*32 4 */ >bool freed_tables; /*36 1 */ > >/* size: 40, cachelines: 1, members: 6 */ >/* padding: 3 */ >/* last cacheline: 40 bytes */ > }; > > IIRC what you did was make void *__call_single_data::info the last > member and a union until the full cacheline size (64). Given the above > that would get us 24 bytes for csd, leaving us 40 for that > flush_tlb_info. > > But then we can still do something like the below, which doesn't change > things and still gets rid of that dual function crud, simplifying > smp_call_function_many again. > > Index: linux-2.6/arch/x86/include/asm/tlbflush.h > === > --- linux-2.6.orig/arch/x86/include/asm/tlbflush.h > +++ linux-2.6/arch/x86/include/asm/tlbflush.h > @@ -546,8 +546,9 @@ struct flush_tlb_info { > unsigned long start; > unsigned long end; > u64 new_tlb_gen; > - unsigned intstride_shift; > - boolfreed_tables; > + unsigned intcpu; > + unsigned short stride_shift; > + unsigned char freed_tables; > }; > > #define local_flush_tlb() __flush_tlb() > Index: linux-2.6/arch/x86/mm/tlb.c > === > --- linux-2.6.orig/arch/x86/mm/tlb.c > +++ linux-2.6/arch/x86/mm/tlb.c > @@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void * > flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN); > } > > +static void flush_tlb_func(void *info) > +{ > + const struct flush_tlb_info *f = info; > + enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN; > + bool local = false; > + > + if (f->cpu == smp_processor_id()) { > + local = true; > + reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : > TLB_LOCAL_MM_SHOOTDOWN; > + } else { > + inc_irq_stat(irq_tlb_count); > + > + if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm)) > + return; > + > + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED); > + } > + > + flush_tlb_func_common(f, local,
Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
> On Jul 22, 2019, at 11:51 AM, Thomas Gleixner wrote: > > On Mon, 22 Jul 2019, Nadav Amit wrote: >>> On Jul 22, 2019, at 11:37 AM, Thomas Gleixner wrote: >>> >>> On Mon, 22 Jul 2019, Peter Zijlstra wrote: >>> >>>> On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote: >>>>> +/* >>>>> + * Call a function on all processors. May be used during early boot >>>>> while >>>>> + * early_boot_irqs_disabled is set. >>>>> + */ >>>>> +static inline void on_each_cpu(smp_call_func_t func, void *info, int >>>>> wait) >>>>> +{ >>>>> + on_each_cpu_mask(cpu_online_mask, func, info, wait); >>>>> +} >>>> >>>> I'm thinking that one if buggy, nothing protects online mask here. >>> >>> The current implementation has preemption disabled before touching >>> cpu_online_mask which at least protects against a CPU going away as that >>> prevents the stomp machine thread from getting on the CPU. But it's not >>> protected against a CPU coming online concurrently. >> >> I still don’t understand. If you called cpu_online_mask() and did not >> disable preemption before calling it, you are already (today) not protected >> against another CPU coming online. Disabling preemption in on_each_cpu() >> will not solve it. > > Disabling preemption _cannot_ protect against a CPU coming online. It only > can protect against a CPU being offlined. > > The current implementation of on_each_cpu() disables preemption _before_ > touching cpu_online_mask. > > void on_each_cpu(void (*func) (void *info), void *info, int wait) > { >unsigned long flags; > >preempt_disable(); > smp_call_function(func, info, wait); > > smp_call_function() has another preempt_disable as it can be called > separately and it does: > >preempt_disable(); >smp_call_function_many(cpu_online_mask, func, info, wait); > > Your new on_each_cpu() implementation does not. So there is a > difference. Whether it matters or not is a different question, but that > needs to be explained and documented. Thanks for explaining - so your concern is for CPUs being offlined. But unless I am missing something: on_each_cpu() calls __on_each_cpu_mask(), which disables preemption and calls __smp_call_function_many(). Then __smp_call_function_many() runs: cpumask_and(cfd->cpumask, mask, cpu_online_mask); … before choosing which remote CPUs should run the function. So the only case that I was missing is if the current CPU goes away and the function is called locally. Can it happen? I can add documentation and a debug assertion for this case.
Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
> On Jul 22, 2019, at 11:16 AM, Peter Zijlstra wrote: > > On Fri, Jul 19, 2019 at 11:23:06AM -0700, Dave Hansen wrote: >> On 7/18/19 5:58 PM, Nadav Amit wrote: >>> @@ -624,16 +622,11 @@ EXPORT_SYMBOL(on_each_cpu); >>> void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func, >>> void *info, bool wait) >>> { >>> - int cpu = get_cpu(); >>> + preempt_disable(); >>> >>> - smp_call_function_many(mask, func, info, wait); >>> - if (cpumask_test_cpu(cpu, mask)) { >>> - unsigned long flags; >>> - local_irq_save(flags); >>> - func(info); >>> - local_irq_restore(flags); >>> - } >>> - put_cpu(); >>> + __smp_call_function_many(mask, func, func, info, wait); >>> + >>> + preempt_enable(); >>> } >> >> The get_cpu() was missing it too, but it would be nice to add some >> comments about why preempt needs to be off. I was also thinking it >> might make sense to do: >> >> cfd = get_cpu_var(cfd_data); >> __smp_call_function_many(cfd, ...); >> put_cpu_var(cfd_data); >> >> instead of the explicit preempt_enable/disable(), but I don't feel too >> strongly about it. > > It is also required for cpu hotplug. But then smpcfd_dead_cpu() will not respect the “cpu” argument. Do you still prefer it this way (instead of the current preempt_enable() / preempt_disable())?
Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
> On Jul 22, 2019, at 11:37 AM, Thomas Gleixner wrote: > > On Mon, 22 Jul 2019, Peter Zijlstra wrote: > >> On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote: >>> +/* >>> + * Call a function on all processors. May be used during early boot while >>> + * early_boot_irqs_disabled is set. >>> + */ >>> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait) >>> +{ >>> + on_each_cpu_mask(cpu_online_mask, func, info, wait); >>> +} >> >> I'm thinking that one if buggy, nothing protects online mask here. > > The current implementation has preemption disabled before touching > cpu_online_mask which at least protects against a CPU going away as that > prevents the stomp machine thread from getting on the CPU. But it's not > protected against a CPU coming online concurrently. I still don’t understand. If you called cpu_online_mask() and did not disable preemption before calling it, you are already (today) not protected against another CPU coming online. Disabling preemption in on_each_cpu() will not solve it.
Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
> On Jul 22, 2019, at 11:21 AM, Peter Zijlstra wrote: > > On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote: >> +/* >> + * Call a function on all processors. May be used during early boot while >> + * early_boot_irqs_disabled is set. >> + */ >> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait) >> +{ >> +on_each_cpu_mask(cpu_online_mask, func, info, wait); >> +} > > I'm thinking that one if buggy, nothing protects online mask here. on_each_cpu_mask() calls __on_each_cpu_mask() which would disable preemption. The mask might change, but anyhow __smp_call_function_many() would “and” it, after disabling preemption, with (the potentially updated) cpu_online_mask. What is your concern?
Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
> On Jul 19, 2019, at 11:38 AM, Dave Hansen wrote: > > On 7/18/19 5:58 PM, Nadav Amit wrote: >> +struct tlb_state_shared { >> +/* >> + * We can be in one of several states: >> + * >> + * - Actively using an mm. Our CPU's bit will be set in >> + *mm_cpumask(loaded_mm) and is_lazy == false; >> + * >> + * - Not using a real mm. loaded_mm == _mm. Our CPU's bit >> + *will not be set in mm_cpumask(_mm) and is_lazy == false. >> + * >> + * - Lazily using a real mm. loaded_mm != _mm, our bit >> + *is set in mm_cpumask(loaded_mm), but is_lazy == true. >> + *We're heuristically guessing that the CR3 load we >> + *skipped more than makes up for the overhead added by >> + *lazy mode. >> + */ >> +bool is_lazy; >> +}; >> +DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, >> cpu_tlbstate_shared); > > Could we get a comment about what "shared" means and why we need shared > state? > > Should we change 'tlb_state' to 'tlb_state_private’? I don’t feel strongly about either one. I perferred the one that is likely to cause fewer changes and potential conflicts. Anyhow, I would add a better comment as you asked for. So it is really up to you.
Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
> On Jul 19, 2019, at 3:44 PM, Joe Perches wrote: > > On Fri, 2019-07-19 at 18:41 +0000, Nadav Amit wrote: >>> On Jul 19, 2019, at 11:36 AM, Dave Hansen wrote: >>> >>> On 7/18/19 5:58 PM, Nadav Amit wrote: >>>> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct >>>> arch_tlbflush_unmap_batch *batch) >>>>if (cpumask_test_cpu(cpu, >cpumask)) { >>>>lockdep_assert_irqs_enabled(); >>>>local_irq_disable(); >>>> - flush_tlb_func_local(_flush_tlb_info); >>>> + flush_tlb_func_local((void *)_flush_tlb_info); >>>>local_irq_enable(); >>>>} >>> >>> This looks like superfluous churn. Is it? >> >> Unfortunately not, since full_flush_tlb_info is defined as const. Without it >> you would get: >> >> warning: passing argument 1 of ‘flush_tlb_func_local’ discards ‘const’ >> qualifier from pointer target type [-Wdiscarded-qualifiers] >> >> And flush_tlb_func_local() should get (void *) argument since it is also >> used by the SMP infrastructure. > > huh? > > commit 3db6d5a5ecaf0a778d721ccf9809248350d4bfaf > Author: Nadav Amit > Date: Thu Apr 25 16:01:43 2019 -0700 > > [] > > -static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason) > +static void flush_tlb_func_local(const void *info, enum tlb_flush_reason > reason) Hopefully I decipher the “huh?” correctly... When we moved the flush_tlb_info off the stack, in the patch that you reference, we created a global full_flush_tlb_info variable, which was set as const and delivered to flush_tlb_func_local. I changed the signature of the function to avoid casting. IIRC, setting the variable as constant slightly improved the generated assembly, and anyhow made sense. In this patch-set I need to change the first argument of flush_tlb_func_local to be non-const to match the SMP function signature which takes a single non-const void * argument. Yes, there is what seems to be non-safe casting, but flush_tlb_func_common() casts it back into a const variable. I know that I also added the infrastructure to run a function locally in the SMP infrastructure, but it made sense to have the same signature for local function and remote ones. Does it make more sense now?
Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
> On Jul 19, 2019, at 11:48 AM, Dave Hansen wrote: > > On 7/19/19 11:43 AM, Nadav Amit wrote: >> Andy said that for the lazy tlb optimizations there might soon be more >> shared state. If you prefer, I can move is_lazy outside of tlb_state, and >> not set it in any alternative struct. > > I just wanted to make sure that we capture these rules: > > 1. If the data is only ever accessed on the "owning" CPU via > this_cpu_*(), put it in 'tlb_state'. > 2. If the data is read by other CPUs, put it in 'tlb_state_shared'. > > I actually like the idea of having two structs. Yes, that’s exactly the idea. In the (1) case, we may even be able to mark the struct with __thread qualifier, which IIRC would prevent memory barriers from causing these values being reread.
Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
> On Jul 19, 2019, at 11:38 AM, Dave Hansen wrote: > > On 7/18/19 5:58 PM, Nadav Amit wrote: >> +struct tlb_state_shared { >> +/* >> + * We can be in one of several states: >> + * >> + * - Actively using an mm. Our CPU's bit will be set in >> + *mm_cpumask(loaded_mm) and is_lazy == false; >> + * >> + * - Not using a real mm. loaded_mm == _mm. Our CPU's bit >> + *will not be set in mm_cpumask(_mm) and is_lazy == false. >> + * >> + * - Lazily using a real mm. loaded_mm != _mm, our bit >> + *is set in mm_cpumask(loaded_mm), but is_lazy == true. >> + *We're heuristically guessing that the CR3 load we >> + *skipped more than makes up for the overhead added by >> + *lazy mode. >> + */ >> +bool is_lazy; >> +}; >> +DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, >> cpu_tlbstate_shared); > > Could we get a comment about what "shared" means and why we need shared > state? > > Should we change 'tlb_state' to 'tlb_state_private’? Andy said that for the lazy tlb optimizations there might soon be more shared state. If you prefer, I can move is_lazy outside of tlb_state, and not set it in any alternative struct.
Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
> On Jul 19, 2019, at 11:36 AM, Dave Hansen wrote: > > On 7/18/19 5:58 PM, Nadav Amit wrote: >> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct >> arch_tlbflush_unmap_batch *batch) >> if (cpumask_test_cpu(cpu, >cpumask)) { >> lockdep_assert_irqs_enabled(); >> local_irq_disable(); >> -flush_tlb_func_local(_flush_tlb_info); >> +flush_tlb_func_local((void *)_flush_tlb_info); >> local_irq_enable(); >> } > > This looks like superfluous churn. Is it? Unfortunately not, since full_flush_tlb_info is defined as const. Without it you would get: warning: passing argument 1 of ‘flush_tlb_func_local’ discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers] And flush_tlb_func_local() should get (void *) argument since it is also used by the SMP infrastructure.
Re: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case
> On Jul 18, 2019, at 7:25 PM, Mike Travis wrote: > > It is a fact that the UV is still the UV and SGI is now part of HPE. The > current external product is known as SuperDome Flex. It is both up to date > as well as very well maintained. The ACK I provided was an okay to change > the code, but please make the description accurate. Indeed. Sorry for that - I will update it in v4. I guess you will be ok with me copy-pasting parts of your response into the commit log.
[RFC 6/7] x86/percpu: Optimized arch_raw_cpu_ptr()
Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform better optimizations, such as setting an appropriate base to compute the address instead of an add instruction. The benefit of this computation is relevant only when using compiler segment qualifiers. It is inapplicable to use this method when the address size is greater than the maximum operand size, as it is when building vdso32. Distinguish between the two cases in which preemption is disabled (as happens when this_cpu_ptr() is used) and enabled (when raw_cpu_ptr() is used). This allows optimizations, for instance in rcu_dynticks_eqs_exit(), the following code: mov$0x2bbc0,%rax add%gs:0x7ef07570(%rip),%rax # 0x10358 lock xadd %edx,0xd8(%rax) Turns with this patch into: mov%gs:0x7ef08aa5(%rip),%rax # 0x10358 lock xadd %edx,0x2bc58(%rax) Signed-off-by: Nadav Amit --- arch/x86/include/asm/percpu.h | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 13987f9bc82f..8bac7db397cc 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -73,20 +73,41 @@ #endif /* USE_X86_SEG_SUPPORT */ #define __my_cpu_offsetthis_cpu_read(this_cpu_off) +#define __raw_my_cpu_offset__this_cpu_read(this_cpu_off) +#define __my_cpu_ptr(ptr) (__my_cpu_type(*ptr) *)(uintptr_t)(ptr) +#if USE_X86_SEG_SUPPORT && (!defined(BUILD_VDSO32) || defined(CONFIG_X86_64)) +/* + * Efficient implementation for cases in which the compiler supports C segments. + * Allows the compiler to perform additional optimizations that can save more + * instructions. + * + * This optimized version can only be used if the pointer size equals to native + * operand size, which does not happen when vdso32 is used. + */ +#define __arch_raw_cpu_ptr_qual(qual, ptr) \ +({ \ + (qual typeof(*(ptr)) __kernel __force *)((uintptr_t)(ptr) + \ + __my_cpu_offset); \ +}) +#else /* USE_X86_SEG_SUPPORT && (!defined(BUILD_VDSO32) || defined(CONFIG_X86_64)) */ /* * Compared to the generic __my_cpu_offset version, the following * saves one instruction and avoids clobbering a temp register. */ -#define arch_raw_cpu_ptr(ptr) \ +#define __arch_raw_cpu_ptr_qual(qual, ptr) \ ({ \ unsigned long tcp_ptr__;\ - asm volatile("add " __percpu_arg(1) ", %0" \ + asm qual ("add " __percpu_arg(1) ", %0" \ : "=r" (tcp_ptr__) \ : "m" (__my_cpu_var(this_cpu_off)),\ "0" (ptr)); \ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ }) +#endif /* USE_X86_SEG_SUPPORT && (!defined(BUILD_VDSO32) || defined(CONFIG_X86_64)) */ + +#define arch_raw_cpu_ptr(ptr) __arch_raw_cpu_ptr_qual(volatile, ptr) +#define arch_raw_cpu_ptr_preemption_disabled(ptr) __arch_raw_cpu_ptr_qual( , ptr) #else /* CONFIG_SMP */ #define __percpu_seg_override #define __percpu_prefix"" -- 2.17.1
[RFC 4/7] x86: Fix possible caching of current_task
this_cpu_read_stable() is allowed and supposed to cache and return the same value, specifically for current_task. It actually does not cache current_task very well, which hinders possible invalid caching when the task is switched in __switch_to(). Fix the possible caching by avoiding the use of current in __switch_to()'s dynamic extent. Signed-off-by: Nadav Amit --- arch/x86/include/asm/fpu/internal.h| 7 --- arch/x86/include/asm/resctrl_sched.h | 14 +++--- arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 ++-- arch/x86/kernel/process_32.c | 4 ++-- arch/x86/kernel/process_64.c | 4 ++-- 5 files changed, 17 insertions(+), 16 deletions(-) diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index 4c95c365058a..b537788600fe 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -588,9 +588,10 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu) /* * Load PKRU from the FPU context if available. Delay loading of the - * complete FPU state until the return to userland. + * complete FPU state until the return to userland. Avoid reading current during + * switch. */ -static inline void switch_fpu_finish(struct fpu *new_fpu) +static inline void switch_fpu_finish(struct task_struct *task, struct fpu *new_fpu) { u32 pkru_val = init_pkru_value; struct pkru_state *pk; @@ -598,7 +599,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu) if (!static_cpu_has(X86_FEATURE_FPU)) return; - set_thread_flag(TIF_NEED_FPU_LOAD); + set_ti_thread_flag(task_thread_info(task), TIF_NEED_FPU_LOAD); if (!cpu_feature_enabled(X86_FEATURE_OSPKE)) return; diff --git a/arch/x86/include/asm/resctrl_sched.h b/arch/x86/include/asm/resctrl_sched.h index f6b7fe2833cc..9a00d9df9d02 100644 --- a/arch/x86/include/asm/resctrl_sched.h +++ b/arch/x86/include/asm/resctrl_sched.h @@ -51,7 +51,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key); * simple as possible. * Must be called with preemption disabled. */ -static void __resctrl_sched_in(void) +static void __resctrl_sched_in(struct task_struct *task) { struct resctrl_pqr_state *state = this_cpu_ptr(_state); u32 closid = state->default_closid; @@ -62,13 +62,13 @@ static void __resctrl_sched_in(void) * Else use the closid/rmid assigned to this cpu. */ if (static_branch_likely(_alloc_enable_key)) { - if (current->closid) + if (task->closid) closid = current->closid; } if (static_branch_likely(_mon_enable_key)) { - if (current->rmid) - rmid = current->rmid; + if (task->rmid) + rmid = task->rmid; } if (closid != state->cur_closid || rmid != state->cur_rmid) { @@ -78,15 +78,15 @@ static void __resctrl_sched_in(void) } } -static inline void resctrl_sched_in(void) +static inline void resctrl_sched_in(struct task_struct *task) { if (static_branch_likely(_enable_key)) - __resctrl_sched_in(); + __resctrl_sched_in(task); } #else -static inline void resctrl_sched_in(void) {} +static inline void resctrl_sched_in(struct task_struct *task) {} #endif /* CONFIG_X86_CPU_RESCTRL */ diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c b/arch/x86/kernel/cpu/resctrl/rdtgroup.c index bf3034994754..71bd82a6e3c6 100644 --- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c +++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c @@ -311,7 +311,7 @@ static void update_cpu_closid_rmid(void *info) * executing task might have its own closid selected. Just reuse * the context switch code. */ - resctrl_sched_in(); + resctrl_sched_in(current); } /* @@ -536,7 +536,7 @@ static void move_myself(struct callback_head *head) preempt_disable(); /* update PQR_ASSOC MSR to make resource group go into effect */ - resctrl_sched_in(); + resctrl_sched_in(current); preempt_enable(); kfree(callback); diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c index b8ceec4974fe..699a4c95ab13 100644 --- a/arch/x86/kernel/process_32.c +++ b/arch/x86/kernel/process_32.c @@ -292,10 +292,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) this_cpu_write(current_task, next_p); - switch_fpu_finish(next_fpu); + switch_fpu_finish(next_p, next_fpu); /* Load the Intel cache allocation PQR MSR. */ - resctrl_sched_in(); + resctrl_sched_in(next_p); return prev_p; } diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 250e4c4ac6d9..e945bc744804 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -575,7 +575,7 @@ __switch
[RFC 3/7] x86/percpu: Use C for percpu accesses when possible
The percpu code mostly uses inline assembly. Using segment qualifiers allows to use C code instead, which enables the compiler to perform various optimizations (e.g., CSE). For example, in __schedule() the following two instructions: mov%gs:0x7e5f1eff(%rip),%edx# 0x10350 movslq %edx,%rdx Turn with this patch into: movslq %gs:0x7e5f2e6e(%rip),%rax# 0x10350 In addition, operations that have no guarantee against concurrent interrupts or preemption, such as __this_cpu_cmpxchg() can be further optimized by the compiler when they are implemented in C, for example in call_timer_fn(). Signed-off-by: Nadav Amit --- arch/x86/include/asm/percpu.h | 115 ++--- arch/x86/include/asm/preempt.h | 3 +- 2 files changed, 107 insertions(+), 11 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 1fe348884477..13987f9bc82f 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -439,13 +439,88 @@ do { \ */ #define this_cpu_read_stable(var) percpu_stable_op("mov", var) +#if USE_X86_SEG_SUPPORT + +#define __raw_cpu_read(qual, pcp) \ +({ \ + *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)); \ +}) + +#define __raw_cpu_write(qual, pcp, val) \ + do {\ + *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \ + } while (0) + +/* + * Performance-wise, C operations are only more efficient than their inline + * assembly counterparts for non-volatile variables (__this_*) and for volatile + * loads and stores. + * + * Since we do not use assembly, we are free to define 64-bit operations + * on 32-bit architecture + */ +#define __raw_cpu_add(pcp, val)do { __my_cpu_var(pcp) += (val); } while (0) +#define __raw_cpu_and(pcp, val)do { __my_cpu_var(pcp) &= (val); } while (0) +#define __raw_cpu_or(pcp, val) do { __my_cpu_var(pcp) |= (val); } while (0) +#define __raw_cpu_add_return(pcp, val) ({ __my_cpu_var(pcp) += (val); }) + +#define __raw_cpu_xchg(pcp, val) \ +({ \ + typeof(pcp) pxo_ret__ = __my_cpu_var(pcp); \ + \ + __my_cpu_var(pcp) = (val); \ + pxo_ret__; \ +}) + +#define __raw_cpu_cmpxchg(pcp, oval, nval) \ +({ \ + __my_cpu_type(pcp) *__p = __my_cpu_ptr(&(pcp)); \ + \ + typeof(pcp) __ret = *__p; \ + \ + if (__ret == (oval))\ + *__p = nval;\ + __ret; \ +}) + +#define raw_cpu_read_1(pcp)__raw_cpu_read(, pcp) +#define raw_cpu_read_2(pcp)__raw_cpu_read(, pcp) +#define raw_cpu_read_4(pcp)__raw_cpu_read(, pcp) +#define raw_cpu_write_1(pcp, val) __raw_cpu_write(, pcp, val) +#define raw_cpu_write_2(pcp, val) __raw_cpu_write(, pcp, val) +#define raw_cpu_write_4(pcp, val) __raw_cpu_write(, pcp, val) +#define raw_cpu_add_1(pcp, val)__raw_cpu_add(pcp, val) +#define raw_cpu_add_2(pcp, val)__raw_cpu_add(pcp, val) +#define raw_cpu_add_4(pcp, val)__raw_cpu_add(pcp, val) +#define raw_cpu_and_1(pcp, val)__raw_cpu_and(pcp, val) +#define raw_cpu_and_2(pcp, val)__raw_cpu_and(pcp, val) +#define raw_cpu_and_4(pcp, val)__raw_cpu_and(pcp, val) +#define raw_cpu_or_1(pcp, val) __raw_cpu_or(pcp, val) +#define raw_cpu_or_2(pcp, val) __raw_cpu_or(pcp, val) +#define raw_cpu_or_4(pcp, val) __raw_cpu_or(pcp, val) +#define raw_cpu_xchg_1(pcp, val) __raw_cpu_xchg(pcp, val) +#define raw_cpu_xchg_2(pcp, val) __raw_cpu_xchg(pcp, val) +#define raw_cpu_xchg_4(pcp, val) __raw_cpu_xchg(pcp, val) +#define raw_cpu_add_return_1(pcp, val) __raw_cpu_add_return(pcp, val) +#define raw_cpu_add_return_2(pcp, val) __raw_cpu_add_return(pcp, val) +#define raw_cpu_add_return_4(pcp, val) __raw_cpu_add_return(pcp, val) +#define raw_cpu_add_return_8(pcp, val) __raw_cpu_add_return(pcp, val) +#define raw_cpu
[RFC 1/7] compiler: Report x86 segment support
GCC v6+ supports x86 segment qualifiers (__seg_gs and __seg_fs). Define COMPILER_HAS_X86_SEG_SUPPORT when it is supported. Signed-off-by: Nadav Amit --- include/linux/compiler-gcc.h | 4 1 file changed, 4 insertions(+) diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index e8579412ad21..787b5971e41f 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -149,6 +149,10 @@ #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1 #endif +#if GCC_VERSION >= 6 +#define COMPILER_HAS_X86_SEG_SUPPORT 1 +#endif + /* * Turn individual warnings and errors on and off locally, depending * on version. -- 2.17.1
[RFC 0/7] x86/percpu: Use segment qualifiers
GCC 6+ supports segment qualifiers. Using them allows to implement several optimizations: 1. Avoid unnecessary instructions when an operation is carried on read/written per-cpu value, and instead allow the compiler to set instructions that access per-cpu value directly. 2. Make this_cpu_ptr() more efficient and allow its value to be cached, since preemption must be disabled when this_cpu_ptr() is used. 3. Provide better alternative for this_cpu_read_stable() that caches values more efficiently using alias attribute to const variable. 4. Allow the compiler to perform other optimizations (e.g. CSE). 5. Use rip-relative addressing in per_cpu_read_stable(), which make it PIE-ready. "size" and Peter's compare do not seem to show the impact on code size reduction correctly. Summing the code size according to nm on defconfig shows a minor reduction from 11349763 to 11339840 (0.09%). Nadav Amit (7): compiler: Report x86 segment support x86/percpu: Use compiler segment prefix qualifier x86/percpu: Use C for percpu accesses when possible x86: Fix possible caching of current_task percpu: Assume preemption is disabled on per_cpu_ptr() x86/percpu: Optimized arch_raw_cpu_ptr() x86/current: Aggressive caching of current arch/x86/include/asm/current.h | 30 +++ arch/x86/include/asm/fpu/internal.h| 7 +- arch/x86/include/asm/percpu.h | 293 +++-- arch/x86/include/asm/preempt.h | 3 +- arch/x86/include/asm/resctrl_sched.h | 14 +- arch/x86/kernel/cpu/Makefile | 1 + arch/x86/kernel/cpu/common.c | 7 +- arch/x86/kernel/cpu/current.c | 16 ++ arch/x86/kernel/cpu/resctrl/rdtgroup.c | 4 +- arch/x86/kernel/process_32.c | 4 +- arch/x86/kernel/process_64.c | 4 +- include/asm-generic/percpu.h | 12 + include/linux/compiler-gcc.h | 4 + include/linux/compiler.h | 2 +- include/linux/percpu-defs.h| 33 ++- 15 files changed, 346 insertions(+), 88 deletions(-) create mode 100644 arch/x86/kernel/cpu/current.c -- 2.17.1
[RFC 2/7] x86/percpu: Use compiler segment prefix qualifier
Using a segment prefix qualifier is cleaner than using a segment prefix in the inline assembly, and provides the compiler with more information, telling it that __seg_gs:[addr] is different than [addr] when it analyzes data dependencies. It also enables various optimizations that will be implemented in the next patches. Use segment prefix qualifiers when they are supported. Unfortunately, gcc does not provide a way to remove segment qualifiers, which is needed to use typeof() to create local instances of the per-cpu variable. For this reason, do not use the segment qualifier for per-cpu variables, and do casting using the segment qualifier instead. Signed-off-by: Nadav Amit --- arch/x86/include/asm/percpu.h | 153 ++ 1 file changed, 102 insertions(+), 51 deletions(-) diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h index 2278797c769d..1fe348884477 100644 --- a/arch/x86/include/asm/percpu.h +++ b/arch/x86/include/asm/percpu.h @@ -45,8 +45,33 @@ #include #include +#if defined(COMPILER_HAS_X86_SEG_SUPPORT) && IS_ENABLED(CONFIG_SMP) +#define USE_X86_SEG_SUPPORT1 +#else +#define USE_X86_SEG_SUPPORT0 +#endif + #ifdef CONFIG_SMP + +#if USE_X86_SEG_SUPPORT + +#ifdef CONFIG_X86_64 +#define __percpu_seg_override __seg_gs +#else +#define __percpu_seg_override __seg_fs +#endif + +#define __percpu_prefix"" +#define __force_percpu_prefix "%%"__stringify(__percpu_seg)":" + +#else /* USE_X86_SEG_SUPPORT */ + +#define __percpu_seg_override #define __percpu_prefix"%%"__stringify(__percpu_seg)":" +#define __force_percpu_prefix __percpu_prefix + +#endif /* USE_X86_SEG_SUPPORT */ + #define __my_cpu_offsetthis_cpu_read(this_cpu_off) /* @@ -58,14 +83,21 @@ unsigned long tcp_ptr__;\ asm volatile("add " __percpu_arg(1) ", %0" \ : "=r" (tcp_ptr__) \ -: "m" (this_cpu_off), "0" (ptr)); \ +: "m" (__my_cpu_var(this_cpu_off)),\ + "0" (ptr)); \ (typeof(*(ptr)) __kernel __force *)tcp_ptr__; \ }) -#else +#else /* CONFIG_SMP */ +#define __percpu_seg_override #define __percpu_prefix"" -#endif +#define __force_percpu_prefix "" +#endif /* CONFIG_SMP */ +#define __my_cpu_type(var) typeof(var) __percpu_seg_override +#define __my_cpu_ptr(ptr) (__my_cpu_type(*ptr) *)(uintptr_t)(ptr) +#define __my_cpu_var(var) (*__my_cpu_ptr()) #define __percpu_arg(x)__percpu_prefix "%" #x +#define __force_percpu_arg(x) __force_percpu_prefix "%" #x /* * Initialized pointers to per-cpu variables needed for the boot @@ -98,22 +130,22 @@ do { \ switch (sizeof(var)) { \ case 1: \ asm qual (op "b %1,"__percpu_arg(0) \ - : "+m" (var)\ + : "+m" (__my_cpu_var(var)) \ : "qi" ((pto_T__)(val))); \ break; \ case 2: \ asm qual (op "w %1,"__percpu_arg(0) \ - : "+m" (var)\ + : "+m" (__my_cpu_var(var)) \ : "ri" ((pto_T__)(val))); \ break; \ case 4: \ asm qual (op "l %1,"__percpu_arg(0) \ - : "+m" (var)\ + : "+m" (__my_cpu_var(var)) \ : "ri" ((pto_T__)(val))); \ break; \ case 8: \ asm qual (op "q %1,"__percpu_arg(0) \ - : "+m" (var)\ + : "+m" (__my_cpu_var(var)) \ : "re" ((pto_T__)(val))); \ break; \ default: __bad_percpu_size(); \ @@ -138,71 +170,79 @@ do { \ switch (sizeof(var)) { \ case 1: \
[RFC 5/7] percpu: Assume preemption is disabled on per_cpu_ptr()
When per_cpu_ptr() is used, the caller should have preemption disabled, as otherwise the pointer is meaningless. If the user wants an "unstable" pointer he should call raw_cpu_ptr(). Add an assertion to check that indeed preemption is disabled, and distinguish between the two cases to allow further, per-arch optimizations. Signed-off-by: Nadav Amit --- include/asm-generic/percpu.h | 12 include/linux/percpu-defs.h | 33 - 2 files changed, 44 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h index c2de013b2cf4..7853605f4210 100644 --- a/include/asm-generic/percpu.h +++ b/include/asm-generic/percpu.h @@ -36,6 +36,14 @@ extern unsigned long __per_cpu_offset[NR_CPUS]; #define my_cpu_offset __my_cpu_offset #endif +/* + * Determine the offset of the current active processor when preemption is + * disabled. Can be overriden by arch code. + */ +#ifndef __raw_my_cpu_offset +#define __raw_my_cpu_offset __my_cpu_offset +#endif + /* * Arch may define arch_raw_cpu_ptr() to provide more efficient address * translations for raw_cpu_ptr(). @@ -44,6 +52,10 @@ extern unsigned long __per_cpu_offset[NR_CPUS]; #define arch_raw_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset) #endif +#ifndef arch_raw_cpu_ptr_preemptable +#define arch_raw_cpu_ptr_preemptable(ptr) SHIFT_PERCPU_PTR(ptr, __raw_my_cpu_offset) +#endif + #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA extern void setup_per_cpu_areas(void); #endif diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h index a6fabd865211..13afca8a37e7 100644 --- a/include/linux/percpu-defs.h +++ b/include/linux/percpu-defs.h @@ -237,20 +237,51 @@ do { \ SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); \ }) +#ifndef arch_raw_cpu_ptr_preemption_disabled +#define arch_raw_cpu_ptr_preemption_disabled(ptr) \ + arch_raw_cpu_ptr(ptr) +#endif + +#define raw_cpu_ptr_preemption_disabled(ptr) \ +({ \ + __verify_pcpu_ptr(ptr); \ + arch_raw_cpu_ptr_preemption_disabled(ptr); \ +}) + +/* + * If preemption is enabled, we need to read the pointer atomically on + * raw_cpu_ptr(). However if it is disabled, we can use the + * raw_cpu_ptr_nopreempt(), which is potentially more efficient. Similarly, we + * can use the preemption-disabled version if the kernel is non-preemptable or + * if voluntary preemption is used. + */ +#ifdef CONFIG_PREEMPT + #define raw_cpu_ptr(ptr) \ ({ \ __verify_pcpu_ptr(ptr); \ arch_raw_cpu_ptr(ptr); \ }) +#else + +#define raw_cpu_ptr(ptr) raw_cpu_ptr_preemption_disabled(ptr) + +#endif + #ifdef CONFIG_DEBUG_PREEMPT +/* + * Unlike other this_cpu_* operations, this_cpu_ptr() requires that preemption + * will be disabled. In contrast, raw_cpu_ptr() does not require that. + */ #define this_cpu_ptr(ptr) \ ({ \ + __this_cpu_preempt_check("ptr");\ __verify_pcpu_ptr(ptr); \ SHIFT_PERCPU_PTR(ptr, my_cpu_offset); \ }) #else -#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr) +#define this_cpu_ptr(ptr) raw_cpu_ptr_preemption_disabled(ptr) #endif #else /* CONFIG_SMP */ -- 2.17.1
[RFC 7/7] x86/current: Aggressive caching of current
The current_task is supposed to be constant in each thread and therefore does not need to be reread. There is already an attempt to cache it using inline assembly, using this_cpu_read_stable(), which hides the dependency on the read memory address. However, this caching is not working very well. For example, sync_mm_rss() still reads current_task twice for no reason. Allow more aggressive caching by aliasing current_task to into a constant const_current_task and reading from the constant copy. Doing so requires the compiler to support x86 segment qualifiers. Hide const_current_task in a different compilation unit to avoid the compiler from assuming that the value is constant during compilation. Signed-off-by: Nadav Amit --- arch/x86/include/asm/current.h | 30 ++ arch/x86/kernel/cpu/Makefile | 1 + arch/x86/kernel/cpu/common.c | 7 +-- arch/x86/kernel/cpu/current.c | 16 include/linux/compiler.h | 2 +- 5 files changed, 49 insertions(+), 7 deletions(-) create mode 100644 arch/x86/kernel/cpu/current.c diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h index 3e204e6140b5..7f093e81a647 100644 --- a/arch/x86/include/asm/current.h +++ b/arch/x86/include/asm/current.h @@ -10,11 +10,41 @@ struct task_struct; DECLARE_PER_CPU(struct task_struct *, current_task); +#if USE_X86_SEG_SUPPORT + +/* + * Hold a constant alias for current_task, which would allow to avoid caching of + * current task. + * + * We must mark const_current_task with the segment qualifiers, as otherwise gcc + * would do redundant reads of const_current_task. + */ +DECLARE_PER_CPU(struct task_struct * const __percpu_seg_override, const_current_task); + +static __always_inline struct task_struct *get_current(void) +{ + + /* +* GCC is missing functionality of removing segment qualifiers, which +* messes with per-cpu infrastructure that holds local copies. Use +* __raw_cpu_read to avoid holding any copy. +*/ + return __raw_cpu_read(, const_current_task); +} + +#else /* USE_X86_SEG_SUPPORT */ + +/* + * Without segment qualifier support, the per-cpu infrastrucutre is not + * suitable for reading constants, so use this_cpu_read_stable() in this case. + */ static __always_inline struct task_struct *get_current(void) { return this_cpu_read_stable(current_task); } +#endif /* USE_X86_SEG_SUPPORT */ + #define current get_current() #endif /* __ASSEMBLY__ */ diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile index d7a1e5a9331c..d816f03a37d7 100644 --- a/arch/x86/kernel/cpu/Makefile +++ b/arch/x86/kernel/cpu/Makefile @@ -19,6 +19,7 @@ CFLAGS_common.o := $(nostackp) obj-y := cacheinfo.o scattered.o topology.o obj-y += common.o +obj-y += current.o obj-y += rdrand.o obj-y += match.o obj-y += bugs.o diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index e0489d2860d3..33a6b51e8059 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1607,13 +1607,8 @@ DEFINE_PER_CPU_FIRST(struct fixed_percpu_data, EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data); /* - * The following percpu variables are hot. Align current_task to - * cacheline size such that they fall in the same cacheline. + * The following percpu variables are hot. */ -DEFINE_PER_CPU(struct task_struct *, current_task) cacheline_aligned = - _task; -EXPORT_PER_CPU_SYMBOL(current_task); - DEFINE_PER_CPU(struct irq_stack *, hardirq_stack_ptr); DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1; diff --git a/arch/x86/kernel/cpu/current.c b/arch/x86/kernel/cpu/current.c new file mode 100644 index ..3238c6e34984 --- /dev/null +++ b/arch/x86/kernel/cpu/current.c @@ -0,0 +1,16 @@ +#include +#include + +/* + * Align current_task to cacheline size such that they fall in the same + * cacheline. + */ +DEFINE_PER_CPU(struct task_struct *, current_task) cacheline_aligned = + _task; +EXPORT_PER_CPU_SYMBOL(current_task); + +#if USE_X86_SEG_SUPPORT +DECLARE_PER_CPU(struct task_struct * const __percpu_seg_override, const_current_task) + __attribute__((alias("current_task"))); +EXPORT_PER_CPU_SYMBOL(const_current_task); +#endif diff --git a/include/linux/compiler.h b/include/linux/compiler.h index f0fd5636fddb..1b6ee9ab6373 100644 --- a/include/linux/compiler.h +++ b/include/linux/compiler.h @@ -299,7 +299,7 @@ unsigned long read_word_at_a_time(const void *addr) */ #define __ADDRESSABLE(sym) \ static void * __section(".discard.addressable") __used \ - __PASTE(__addressable_##sym, __LINE__) = (void *) + __PASTE(__addressable_##sym, __LINE__) = (void *)(uintptr_t) /** * offset_to_ptr - convert a relative memory offset to an absolute pointer -- 2.17.1
[PATCH v3 2/9] x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
To use flush_tlb_func_local() as an argument to __smp_call_function_many() we need it to have a single (void *) parameter. Eliminate the second parameter and deduce the reason for the flush. Cc: Peter Zijlstra Cc: Dave Hansen Cc: Rik van Riel Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Josh Poimboeuf Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 4de9704c4aaf..233f3d8308db 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -635,9 +635,12 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen); } -static void flush_tlb_func_local(const void *info, enum tlb_flush_reason reason) +static void flush_tlb_func_local(void *info) { const struct flush_tlb_info *f = info; + enum tlb_flush_reason reason; + + reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN; flush_tlb_func_common(f, true, reason); } @@ -790,7 +793,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { lockdep_assert_irqs_enabled(); local_irq_disable(); - flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN); + flush_tlb_func_local(info); local_irq_enable(); } @@ -862,7 +865,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) if (cpumask_test_cpu(cpu, >cpumask)) { lockdep_assert_irqs_enabled(); local_irq_disable(); - flush_tlb_func_local(_flush_tlb_info, TLB_LOCAL_SHOOTDOWN); + flush_tlb_func_local(_flush_tlb_info); local_irq_enable(); } -- 2.20.1
[PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently
To improve TLB shootdown performance, flush the remote and local TLBs concurrently. Introduce flush_tlb_multi() that does so. Introduce paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen and hyper-v are only compile-tested). While the updated smp infrastructure is capable of running a function on a single local core, it is not optimized for this case. The multiple function calls and the indirect branch introduce some overhead, and might make local TLB flushes slower than they were before the recent changes. Before calling the SMP infrastructure, check if only a local TLB flush is needed to restore the lost performance in this common case. This requires to check mm_cpumask() one more time, but unless this mask is updated very frequently, this should impact performance negatively. Cc: "K. Y. Srinivasan" Cc: Haiyang Zhang Cc: Stephen Hemminger Cc: Sasha Levin Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: x...@kernel.org Cc: Juergen Gross Cc: Paolo Bonzini Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Cc: Boris Ostrovsky Cc: linux-hyp...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: k...@vger.kernel.org Cc: xen-de...@lists.xenproject.org Signed-off-by: Nadav Amit --- arch/x86/hyperv/mmu.c | 10 +++--- arch/x86/include/asm/paravirt.h | 6 ++-- arch/x86/include/asm/paravirt_types.h | 4 +-- arch/x86/include/asm/tlbflush.h | 8 ++--- arch/x86/include/asm/trace/hyperv.h | 2 +- arch/x86/kernel/kvm.c | 11 +-- arch/x86/kernel/paravirt.c| 2 +- arch/x86/mm/tlb.c | 47 ++- arch/x86/xen/mmu_pv.c | 11 +++ include/trace/events/xen.h| 2 +- 10 files changed, 62 insertions(+), 41 deletions(-) diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c index e65d7fe6489f..8740d8b21db3 100644 --- a/arch/x86/hyperv/mmu.c +++ b/arch/x86/hyperv/mmu.c @@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset, return gva_n - offset; } -static void hyperv_flush_tlb_others(const struct cpumask *cpus, - const struct flush_tlb_info *info) +static void hyperv_flush_tlb_multi(const struct cpumask *cpus, + const struct flush_tlb_info *info) { int cpu, vcpu, gva_n, max_gvas; struct hv_tlb_flush **flush_pcpu; @@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus, u64 status = U64_MAX; unsigned long flags; - trace_hyperv_mmu_flush_tlb_others(cpus, info); + trace_hyperv_mmu_flush_tlb_multi(cpus, info); if (!hv_hypercall_pg) goto do_native; @@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask *cpus, if (!(status & HV_HYPERCALL_RESULT_MASK)) return; do_native: - native_flush_tlb_others(cpus, info); + native_flush_tlb_multi(cpus, info); } static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus, @@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void) return; pr_info("Using hypercall for remote TLB flush\n"); - pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others; + pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi; pv_ops.mmu.tlb_remove_table = tlb_remove_table; } diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index dce26f1d13e1..8c6c2394393b 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr) PVOP_VCALL1(mmu.flush_tlb_one_user, addr); } -static inline void flush_tlb_others(const struct cpumask *cpumask, - const struct flush_tlb_info *info) +static inline void flush_tlb_multi(const struct cpumask *cpumask, + const struct flush_tlb_info *info) { - PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info); + PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info); } static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void *table) diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index 639b2df445ee..c82969f38845 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -211,8 +211,8 @@ struct pv_mmu_ops { void (*flush_tlb_user)(void); void (*flush_tlb_kernel)(void); void (*flush_tlb_one_user)(unsigned long addr); - void (*flush_tlb_others)(const struct cpumask *cpus, -const struct flush_tlb_info *info); + void (*flush_tlb_multi)(const struct cpumask *cpus, + const struct flush_tlb_info *info); void (*tlb_remove_table)(struct mmu_
[PATCH v3 9/9] x86/mm/tlb: Remove unnecessary uses of the inline keyword
The compiler is smart enough without these hints. Cc: Andy Lutomirski Cc: Peter Zijlstra Suggested-by: Dave Hansen Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 40daad52ec7d..2ddc32e787f3 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -189,7 +189,7 @@ static void sync_current_stack_to_mm(struct mm_struct *mm) } } -static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next) +static unsigned long mm_mangle_tif_spec_ib(struct task_struct *next) { unsigned long next_tif = task_thread_info(next)->flags; unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB; @@ -748,7 +748,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, flush_tlb_info); static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx); #endif -static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, +static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, unsigned long start, unsigned long end, unsigned int stride_shift, bool freed_tables, u64 new_tlb_gen) @@ -774,7 +774,7 @@ static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm, return info; } -static inline void put_flush_tlb_info(void) +static void put_flush_tlb_info(void) { #ifdef CONFIG_DEBUG_VM /* Complete reentrency prevention checks */ -- 2.20.1
[PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate
cpu_tlbstate is mostly private and only the variable is_lazy is shared. This causes some false-sharing when TLB flushes are performed. Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark each one accordingly. Cc: Dave Hansen Cc: Andy Lutomirski Cc: Peter Zijlstra Signed-off-by: Nadav Amit --- arch/x86/include/asm/tlbflush.h | 39 ++--- arch/x86/mm/init.c | 2 +- arch/x86/mm/tlb.c | 15 - 3 files changed, 31 insertions(+), 25 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 610e47dc66ef..8490591d3cdf 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -178,23 +178,6 @@ struct tlb_state { u16 loaded_mm_asid; u16 next_asid; - /* -* We can be in one of several states: -* -* - Actively using an mm. Our CPU's bit will be set in -*mm_cpumask(loaded_mm) and is_lazy == false; -* -* - Not using a real mm. loaded_mm == _mm. Our CPU's bit -*will not be set in mm_cpumask(_mm) and is_lazy == false. -* -* - Lazily using a real mm. loaded_mm != _mm, our bit -*is set in mm_cpumask(loaded_mm), but is_lazy == true. -*We're heuristically guessing that the CR3 load we -*skipped more than makes up for the overhead added by -*lazy mode. -*/ - bool is_lazy; - /* * If set we changed the page tables in such a way that we * needed an invalidation of all contexts (aka. PCIDs / ASIDs). @@ -240,7 +223,27 @@ struct tlb_state { */ struct tlb_context ctxs[TLB_NR_DYN_ASIDS]; }; -DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate); +DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate); + +struct tlb_state_shared { + /* +* We can be in one of several states: +* +* - Actively using an mm. Our CPU's bit will be set in +*mm_cpumask(loaded_mm) and is_lazy == false; +* +* - Not using a real mm. loaded_mm == _mm. Our CPU's bit +*will not be set in mm_cpumask(_mm) and is_lazy == false. +* +* - Lazily using a real mm. loaded_mm != _mm, our bit +*is set in mm_cpumask(loaded_mm), but is_lazy == true. +*We're heuristically guessing that the CR3 load we +*skipped more than makes up for the overhead added by +*lazy mode. +*/ + bool is_lazy; +}; +DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared); /* * Blindly accessing user memory from NMI context can be dangerous diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index fd10d91a6115..34027f36a944 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -951,7 +951,7 @@ void __init zone_sizes_init(void) free_area_init_nodes(max_zone_pfns); } -__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { +__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = { .loaded_mm = _mm, .next_asid = 1, .cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */ diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 63c00908bdd9..af80c274c88d 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -145,7 +145,7 @@ void leave_mm(int cpu) return; /* Warn if we're not lazy. */ - WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy)); + WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy)); switch_mm(NULL, _mm, NULL); } @@ -277,7 +277,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, { struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm); u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid); - bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy); + bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy); unsigned cpu = smp_processor_id(); u64 next_tlb_gen; bool need_flush; @@ -322,7 +322,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, __flush_tlb_all(); } #endif - this_cpu_write(cpu_tlbstate.is_lazy, false); + this_cpu_write(cpu_tlbstate_shared.is_lazy, false); /* * The membarrier system call requires a full memory barrier and @@ -463,7 +463,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) if (this_cpu_read(cpu_tlbstate.loaded_mm) == _mm) return; - this_cpu_write(cpu_tlbstate.is_lazy, true); + this_cpu_write(cpu_tlbstate_shared.is_lazy, true); } /* @@ -544,7 +544,7 @@ static void flush_tlb_func_common(const struct flush_tlb_info *f, VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) != loaded_mm->context.ctx
[PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to optimize the code. Open-coding eliminates the need for the indirect branch that is used to call is_lazy(), and in CPUs that are vulnerable to Spectre v2, it eliminates the retpoline. In addition, it allows to use a preallocated cpumask to compute the CPUs that should be. This would later allow us not to adapt on_each_cpu_cond_mask() to support local and remote functions. Note that calling tlb_is_not_lazy() for every CPU that needs to be flushed, as done in native_flush_tlb_multi() might look ugly, but it is equivalent to what is currently done in on_each_cpu_cond_mask(). Actually, native_flush_tlb_multi() does it more efficiently since it avoids using an indirect branch for the matter. Cc: Peter Zijlstra Cc: Dave Hansen Cc: Rik van Riel Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Josh Poimboeuf Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 40 ++-- 1 file changed, 34 insertions(+), 6 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 233f3d8308db..abbf55fa8b81 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -658,11 +658,13 @@ static void flush_tlb_func_remote(void *info) flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN); } -static bool tlb_is_not_lazy(int cpu, void *data) +static bool tlb_is_not_lazy(int cpu) { return !per_cpu(cpu_tlbstate.is_lazy, cpu); } +static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask); + void native_flush_tlb_others(const struct cpumask *cpumask, const struct flush_tlb_info *info) { @@ -706,12 +708,38 @@ void native_flush_tlb_others(const struct cpumask *cpumask, * up on the new contents of what used to be page tables, while * doing a speculative memory access. */ - if (info->freed_tables) + if (info->freed_tables) { smp_call_function_many(cpumask, flush_tlb_func_remote, (void *)info, 1); - else - on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote, - (void *)info, 1, GFP_ATOMIC, cpumask); + } else { + /* +* Although we could have used on_each_cpu_cond_mask(), +* open-coding it has performance advantages, as it eliminates +* the need for indirect calls or retpolines. In addition, it +* allows to use a designated cpumask for evaluating the +* condition, instead of allocating one. +* +* This code works under the assumption that there are no nested +* TLB flushes, an assumption that is already made in +* flush_tlb_mm_range(). +* +* cond_cpumask is logically a stack-local variable, but it is +* more efficient to have it off the stack and not to allocate +* it on demand. Preemption is disabled and this code is +* non-reentrant. +*/ + struct cpumask *cond_cpumask = this_cpu_ptr(_tlb_mask); + int cpu; + + cpumask_clear(cond_cpumask); + + for_each_cpu(cpu, cpumask) { + if (tlb_is_not_lazy(cpu)) + __cpumask_set_cpu(cpu, cond_cpumask); + } + smp_call_function_many(cond_cpumask, flush_tlb_func_remote, +(void *)info, 1); + } } /* @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) if (cpumask_test_cpu(cpu, >cpumask)) { lockdep_assert_irqs_enabled(); local_irq_disable(); - flush_tlb_func_local(_flush_tlb_info); + flush_tlb_func_local((void *)_flush_tlb_info); local_irq_enable(); } -- 2.20.1
[PATCH v3 8/9] x86/mm/tlb: Remove UV special case
SGI UV support is outdated and not maintained, and it is not clear how it performs relatively to non-UV. Remove the code to simplify the code. Cc: Peter Zijlstra Cc: Dave Hansen Acked-by: Mike Travis Suggested-by: Andy Lutomirski Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 25 - 1 file changed, 25 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 89f83ad19507..40daad52ec7d 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -684,31 +684,6 @@ void native_flush_tlb_multi(const struct cpumask *cpumask, trace_tlb_flush(TLB_REMOTE_SEND_IPI, (info->end - info->start) >> PAGE_SHIFT); - if (is_uv_system()) { - /* -* This whole special case is confused. UV has a "Broadcast -* Assist Unit", which seems to be a fancy way to send IPIs. -* Back when x86 used an explicit TLB flush IPI, UV was -* optimized to use its own mechanism. These days, x86 uses -* smp_call_function_many(), but UV still uses a manual IPI, -* and that IPI's action is out of date -- it does a manual -* flush instead of calling flush_tlb_func_remote(). This -* means that the percpu tlb_gen variables won't be updated -* and we'll do pointless flushes on future context switches. -* -* Rather than hooking native_flush_tlb_multi() here, I think -* that UV should be updated so that smp_call_function_many(), -* etc, are optimal on UV. -*/ - flush_tlb_func_local((void *)info); - - cpumask = uv_flush_tlb_others(cpumask, info); - if (cpumask) - smp_call_function_many(cpumask, flush_tlb_func_remote, - (void *)info, 1); - return; - } - /* * If no page tables were freed, we can skip sending IPIs to * CPUs in lazy TLB mode. They will flush the CPU themselves -- 2.20.1
[PATCH v3 6/9] x86/mm/tlb: Do not make is_lazy dirty for no reason
Blindly writing to is_lazy for no reason, when the written value is identical to the old value, makes the cacheline dirty for no reason. Avoid making such writes to prevent cache coherency traffic for no reason. Cc: Andy Lutomirski Cc: Peter Zijlstra Suggested-by: Dave Hansen Signed-off-by: Nadav Amit --- arch/x86/mm/tlb.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index af80c274c88d..89f83ad19507 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -322,7 +322,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, __flush_tlb_all(); } #endif - this_cpu_write(cpu_tlbstate_shared.is_lazy, false); + if (was_lazy) + this_cpu_write(cpu_tlbstate_shared.is_lazy, false); /* * The membarrier system call requires a full memory barrier and -- 2.20.1
[PATCH v3 7/9] cpumask: Mark functions as pure
cpumask_next_and() and cpumask_any_but() are pure, and marking them as such seems to generate different and presumably better code for native_flush_tlb_multi(). Signed-off-by: Nadav Amit --- include/linux/cpumask.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h index 0c7db5efe66c..6d57e6372b9d 100644 --- a/include/linux/cpumask.h +++ b/include/linux/cpumask.h @@ -211,7 +211,7 @@ static inline unsigned int cpumask_last(const struct cpumask *srcp) return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits); } -unsigned int cpumask_next(int n, const struct cpumask *srcp); +unsigned int __pure cpumask_next(int n, const struct cpumask *srcp); /** * cpumask_next_zero - get the next unset cpu in a cpumask @@ -228,8 +228,8 @@ static inline unsigned int cpumask_next_zero(int n, const struct cpumask *srcp) return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1); } -int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *); -int cpumask_any_but(const struct cpumask *mask, unsigned int cpu); +int __pure cpumask_next_and(int n, const struct cpumask *, const struct cpumask *); +int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu); unsigned int cpumask_local_spread(unsigned int i, int node); /** -- 2.20.1
[PATCH v3 0/9] x86: Concurrent TLB flushes
[ Cover-letter is identical to v2, including benchmark results, excluding the change log. ] Currently, local and remote TLB flushes are not performed concurrently, which introduces unnecessary overhead - each INVLPG can take 100s of cycles. This patch-set allows TLB flushes to be run concurrently: first request the remote CPUs to initiate the flush, then run it locally, and finally wait for the remote CPUs to finish their work. In addition, there are various small optimizations to avoid unwarranted false-sharing and atomic operations. The proposed changes should also improve the performance of other invocations of on_each_cpu(). Hopefully, no one has relied on this behavior of on_each_cpu() that invoked functions first remotely and only then locally [Peter says he remembers someone might do so, but without further information it is hard to know how to address it]. Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on 2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions: sysbench fileio --file-total-size=3G --file-test-mode=rndwr \ --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run Th. tip-jun28 avg (stdev) +patch-set avg (stdev) change --- - -- -- 1 1267765 (14146) 1299253 (5715) +2.4% 2 1734644 (11936) 1799225 (19577) +3.7% 4 2821268 (41184) 2919132 (40149) +3.4% 8 4171652 (31243) 4376925 (65416) +4.9% 165590729 (24160) 5829866 (8127) +4.2% 246250212 (24481) 6522303 (28044) +4.3% 323994314 (26606) 4077543 (10685) +2.0% 484345177 (28091) 4417821 (41337) +1.6% (Note that on configurations with up to 24 threads numactl was used to set all threads on socket 1, which explains the drop in performance when going to 32 threads). Running the same benchmark with security mitigations disabled (PTI, Spectre, MDS): Th. tip-jun28 avg (stdev) +patch-set avg (stdev) change --- - -- -- 1 1598896 (5174) 1607903 (4091) +0.5% 2 2109472 (17827) 2224726 (4372) +5.4% 4 3448587 (11952) 3668551 (30219) +6.3% 8 5425778 (29641) 5606266 (33519) +3.3% 166931232 (34677) 7054052 (27873) +1.7% 247612473 (23482) 7783138 (13871) +2.2% 324296274 (18029) 4283279 (32323) -0.3% 484770029 (35541) 4764760 (13575) -0.1% Presumably, PTI requires two invalidations of each mapping, which allows to get higher benefits from concurrency when PTI is on. At the same time, when mitigations are on, other overheads reduce the potential speedup. I tried to reduce the size of the code of the main patch, which required restructuring of the series. v2 -> v3: * Open-code the remote/local-flush decision code [Andy] * Fix hyper-v, Xen implementations [Andrew] * Fix redundant TLB flushes. v1 -> v2: * Removing the patches that Thomas took [tglx] * Adding hyper-v, Xen compile-tested implementations [Dave] * Removing UV [Andy] * Adding lazy optimization, removing inline keyword [Dave] * Restructuring patch-set RFCv2 -> v1: * Fix comment on flush_tlb_multi [Juergen] * Removing async invalidation optimizations [Andy] * Adding KVM support [Paolo] Cc: Andy Lutomirski Cc: Borislav Petkov Cc: Boris Ostrovsky Cc: Dave Hansen Cc: Haiyang Zhang Cc: Ingo Molnar Cc: Josh Poimboeuf Cc: Juergen Gross Cc: "K. Y. Srinivasan" Cc: Paolo Bonzini Cc: Peter Zijlstra Cc: Rik van Riel Cc: Sasha Levin Cc: Stephen Hemminger Cc: Thomas Gleixner Cc: k...@vger.kernel.org Cc: linux-hyp...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: virtualizat...@lists.linux-foundation.org Cc: x...@kernel.org Cc: xen-de...@lists.xenproject.org Nadav Amit (9): smp: Run functions concurrently in smp_call_function_many() x86/mm/tlb: Remove reason as argument for flush_tlb_func_local() x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy() x86/mm/tlb: Flush remote and local TLBs concurrently x86/mm/tlb: Privatize cpu_tlbstate x86/mm/tlb: Do not make is_lazy dirty for no reason cpumask: Mark functions as pure x86/mm/tlb: Remove UV special case x86/mm/tlb: Remove unnecessary uses of the inline keyword arch/x86/hyperv/mmu.c | 10 +- arch/x86/include/asm/paravirt.h | 6 +- arch/x86/include/asm/paravirt_types.h | 4 +- arch/x86/include/asm/tlbflush.h | 47 - arch/x86/include/asm/trace/hyperv.h | 2 +- arch/x86/kernel/kvm.c | 11 ++- arch/x86/kernel/paravirt.c| 2 +- arch/x86/mm/init.c| 2 +- arch/x86/mm/tlb.c | 133 -- arch/x86/xen/mmu_pv.c | 11 +-- incl
[PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()
Currently, on_each_cpu() and similar functions do not exploit the potential of concurrency: the function is first executed remotely and only then it is executed locally. Functions such as TLB flush can take considerable time, so this provides an opportunity for performance optimization. To do so, introduce __smp_call_function_many(), which allows the callers to provide local and remote functions that should be executed, and run them concurrently. Keep smp_call_function_many() semantic as it is today for backward compatibility: the called function is not executed in this case locally. __smp_call_function_many() does not use the optimized version for a single remote target that smp_call_function_single() implements. For synchronous function call, smp_call_function_single() keeps a call_single_data (which is used for synchronization) on the stack. Interestingly, it seems that not using this optimization provides greater performance improvements (greater speedup with a single remote target than with multiple ones). Presumably, holding data structures that are intended for synchronization on the stack can introduce overheads due to TLB misses and false-sharing when the stack is used for other purposes. Adding support to run the functions concurrently required to remove a micro-optimization in on_each_cpu() that disabled/enabled IRQs instead of saving/restoring them. The benefit of running the local and remote code concurrently is expected to be greater. Cc: Peter Zijlstra Cc: Dave Hansen Cc: Rik van Riel Cc: Thomas Gleixner Cc: Andy Lutomirski Cc: Josh Poimboeuf Signed-off-by: Nadav Amit --- include/linux/smp.h | 27 ++--- kernel/smp.c| 133 +--- 2 files changed, 83 insertions(+), 77 deletions(-) diff --git a/include/linux/smp.h b/include/linux/smp.h index 6fc856c9eda5..c31c7cde0f77 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -32,11 +32,6 @@ extern unsigned int total_cpus; int smp_call_function_single(int cpuid, smp_call_func_t func, void *info, int wait); -/* - * Call a function on all processors - */ -void on_each_cpu(smp_call_func_t func, void *info, int wait); - /* * Call a function on processors specified by mask, which might include * the local one. @@ -44,6 +39,15 @@ void on_each_cpu(smp_call_func_t func, void *info, int wait); void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func, void *info, bool wait); +/* + * Call a function on all processors. May be used during early boot while + * early_boot_irqs_disabled is set. + */ +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait) +{ + on_each_cpu_mask(cpu_online_mask, func, info, wait); +} + /* * Call a function on each processor for which the supplied function * cond_func returns a positive value. This may include the local @@ -102,8 +106,17 @@ extern void smp_cpus_done(unsigned int max_cpus); * Call a function on all other processors */ void smp_call_function(smp_call_func_t func, void *info, int wait); -void smp_call_function_many(const struct cpumask *mask, - smp_call_func_t func, void *info, bool wait); + +void __smp_call_function_many(const struct cpumask *mask, + smp_call_func_t remote_func, + smp_call_func_t local_func, + void *info, bool wait); + +static inline void smp_call_function_many(const struct cpumask *mask, + smp_call_func_t func, void *info, bool wait) +{ + __smp_call_function_many(mask, func, NULL, info, wait); +} int smp_call_function_any(const struct cpumask *mask, smp_call_func_t func, void *info, int wait); diff --git a/kernel/smp.c b/kernel/smp.c index 616d4d114847..d5d8dee8e3f3 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -388,9 +388,13 @@ int smp_call_function_any(const struct cpumask *mask, EXPORT_SYMBOL_GPL(smp_call_function_any); /** - * smp_call_function_many(): Run a function on a set of other CPUs. + * __smp_call_function_many(): Run a function on a set of CPUs. * @mask: The set of cpus to run on (only runs on online subset). - * @func: The function to run. This must be fast and non-blocking. + * @remote_func: The function to run on remote cores. This must be fast and + * non-blocking. + * @local_func: The function that should be run on this CPU. This must be + * fast and non-blocking. If NULL is provided, no function will + * be executed on this CPU. * @info: An arbitrary pointer to pass to the function. * @wait: If true, wait (atomically) until function has completed *on other CPUs. @@ -401,11 +405,16 @@ EXPORT_SYMBOL_GPL(smp_call_function_any); * hardware interrupt handler or from a bottom half handler. Preemption * must be disabled when calling this function. */ -void smp_call_function_many
Re: [x86/modules] f2c65fb322: will-it-scale.per_process_ops -2.9% regression
> On Jul 18, 2019, at 2:50 AM, kernel test robot wrote: > > Greeting, > > FYI, we noticed a -2.9% regression of will-it-scale.per_process_ops due to > commit: > > > commit: f2c65fb3221adc6b73b0549fc7ba892022db9797 ("x86/modules: Avoid > breaking W^X while loading modules") > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkernel.googlesource.com%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.gitdata=02%7C01%7Cnamit%40vmware.com%7Cdfe19d72eecd46a93fc508d70b65652b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636990402460329367sdata=9MHPlha0MIkDWKe%2BurDDr0QRDMtJ3pPACgDtVXy8pL4%3Dreserved=0 > master > > in testcase: will-it-scale > on test machine: 192 threads Intel(R) Xeon(R) CPU @ 2.20GHz with 192G memory > with following parameters: > > nr_task: 100% > mode: process > test: poll1 > cpufreq_governor: performance > > test-description: Will It Scale takes a testcase and runs it from 1 through > to n parallel copies to see if the testcase will scale. It builds both a > process and threads based test in order to see any differences between the > two. > test-url: > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fantonblanchard%2Fwill-it-scaledata=02%7C01%7Cnamit%40vmware.com%7Cdfe19d72eecd46a93fc508d70b65652b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636990402460329367sdata=YmsJebf0t3yyo7KMCdUu6VIOysebmB%2Fc7huxkOe5cV0%3Dreserved=0 I don’t understand how this patch has any impact on this workload. I ran it and set a function tracer for any function that is impacted by this patch: # cd /sys/kernel/debug/tracing # echo text_poke_early > set_ftrace_filter # echo module_alloc >> set_ftrace_filter # echo bpf_int_jit_compile >> set_ftrace_filter # tail -f trace Nothing came up. Can you please check if you see any of them invoked on your setup? Perhaps you have some bpf filters being installed, although even then this is a one-time (small) overhead for each process invocation.
Re: [PATCH v2] mm/balloon_compaction: avoid duplicate page removal
> On Jul 18, 2019, at 5:26 AM, Michael S. Tsirkin wrote: > > On Thu, Jul 18, 2019 at 05:27:20PM +0800, Wei Wang wrote: >> Fixes: 418a3ab1e778 (mm/balloon_compaction: List interfaces) >> >> A #GP is reported in the guest when requesting balloon inflation via >> virtio-balloon. The reason is that the virtio-balloon driver has >> removed the page from its internal page list (via balloon_page_pop), >> but balloon_page_enqueue_one also calls "list_del" to do the removal. >> This is necessary when it's used from balloon_page_enqueue_list, but >> not from balloon_page_enqueue_one. >> >> So remove the list_del balloon_page_enqueue_one, and update some >> comments as a reminder. >> >> Signed-off-by: Wei Wang > > > ok I posted v3 with typo fixes. 1/2 is this patch with comment changes. Pls > take a look. Thanks (Wei, Michael) for taking care of it. Please cc me on future iterations of the patch. Acked-by: Nadav Amit
Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
> On Jul 16, 2019, at 3:20 PM, Dan Williams wrote: > > On Tue, Jul 16, 2019 at 3:13 PM Nadav Amit wrote: >>> On Jul 16, 2019, at 3:07 PM, Dan Williams wrote: >>> >>> On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton >>> wrote: >>>> On Tue, 18 Jun 2019 21:56:43 + Nadav Amit wrote: >>>> >>>>>> ...and is constant for the life of the device and all subsequent >>>>>> mappings. >>>>>> >>>>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot >>>>>>> (which I >>>>>>> see being done in quite a few cases), but I don’t know the code well >>>>>>> enough >>>>>>> to be certain that every vma should have a single protection and that it >>>>>>> should not change afterwards. >>>>>> >>>>>> No, I'm thinking this would naturally fit as a property hanging off a >>>>>> 'struct dax_device', and then create a version of vmf_insert_mixed() >>>>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that >>>>>> saved value. >>>>> >>>>> Thanks for the detailed explanation. I’ll give it a try (the moment I find >>>>> some free time). I still think that patch 2/3 is beneficial, but based on >>>>> your feedback, patch 3/3 should be dropped. >>>> >>>> It has been a while. What should we do with >>>> >>>> resource-fix-locking-in-find_next_iomem_res.patch >>> >>> This one looks obviously correct to me, you can add: >>> >>> Reviewed-by: Dan Williams >>> >>>> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch >>> >>> This one is a good bug report that we need to go fix pgprot lookups >>> for dax, but I don't think we need to increase the trickiness of the >>> core resource lookup code in the meantime. >> >> I think that traversing big parts of the tree that are known to be >> irrelevant is wasteful no matter what, and this code is used in other cases. >> >> I don’t think the new code is so tricky - can you point to the part of the >> code that you find tricky? > > Given dax can be updated to avoid this abuse of find_next_iomem_res(), > it was a general observation that the patch adds more lines than it > removes and is not strictly necessary. I'm ambivalent as to whether it > is worth pushing upstream. If anything the changelog is going to be > invalidated by a change to dax to avoid find_next_iomem_res(). Can you > update the changelog to be relevant outside of the dax case? Well, 8 lines are comments, 4 are empty lines, so it adds 3 lines of code according to my calculations. :) Having said that, if you think I might have made a mistake, or you are concerned with some bug I might have caused, please let me know. I understand that this logic might have been lying around for some time. I can update the commit log, emphasizing the redundant search operations as motivation and then mentioning dax as an instance that induces overheads due to this overhead, and say it should be handled regardless to this patch-set. Once I find time, I am going to deal with DAX, unless you beat me to it. Thanks, Nadav
Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
> On Jul 16, 2019, at 3:07 PM, Dan Williams wrote: > > On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton > wrote: >> On Tue, 18 Jun 2019 21:56:43 + Nadav Amit wrote: >> >>>> ...and is constant for the life of the device and all subsequent mappings. >>>> >>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot >>>>> (which I >>>>> see being done in quite a few cases), but I don’t know the code well >>>>> enough >>>>> to be certain that every vma should have a single protection and that it >>>>> should not change afterwards. >>>> >>>> No, I'm thinking this would naturally fit as a property hanging off a >>>> 'struct dax_device', and then create a version of vmf_insert_mixed() >>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that >>>> saved value. >>> >>> Thanks for the detailed explanation. I’ll give it a try (the moment I find >>> some free time). I still think that patch 2/3 is beneficial, but based on >>> your feedback, patch 3/3 should be dropped. >> >> It has been a while. What should we do with >> >> resource-fix-locking-in-find_next_iomem_res.patch > > This one looks obviously correct to me, you can add: > > Reviewed-by: Dan Williams > >> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch > > This one is a good bug report that we need to go fix pgprot lookups > for dax, but I don't think we need to increase the trickiness of the > core resource lookup code in the meantime. I think that traversing big parts of the tree that are known to be irrelevant is wasteful no matter what, and this code is used in other cases. I don’t think the new code is so tricky - can you point to the part of the code that you find tricky?
Re: [PATCH 0/3] resource: find_next_iomem_res() improvements
> On Jul 16, 2019, at 3:00 PM, Andrew Morton wrote: > > On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit wrote: > >>> ...and is constant for the life of the device and all subsequent mappings. >>> >>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which >>>> I >>>> see being done in quite a few cases), but I don’t know the code well enough >>>> to be certain that every vma should have a single protection and that it >>>> should not change afterwards. >>> >>> No, I'm thinking this would naturally fit as a property hanging off a >>> 'struct dax_device', and then create a version of vmf_insert_mixed() >>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that >>> saved value. >> >> Thanks for the detailed explanation. I’ll give it a try (the moment I find >> some free time). I still think that patch 2/3 is beneficial, but based on >> your feedback, patch 3/3 should be dropped. > > It has been a while. What should we do with > > resource-fix-locking-in-find_next_iomem_res.patch > resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch > > ? I didn’t get to follow Dan Williams advice. But, both of two patches are fine on my opinion and should go upstream. The first one fixes a bug and the second one improves performance considerably (and removes most of the overhead). Future improvements can go on top of these patches and are not expected to conflict. So I think they should go upstream - the first one immediately, the second one when possible.
Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks
> On Jul 15, 2019, at 4:30 PM, Andrew Cooper wrote: > > On 15/07/2019 19:17, Nadav Amit wrote: >>> On Jul 15, 2019, at 8:16 AM, Andrew Cooper >>> wrote: >>> >>> There is a lot of infrastructure for functionality which is used >>> exclusively in __{save,restore}_processor_state() on the suspend/resume >>> path. >>> >>> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by >>> lapic_{suspend,resume}(). Saving and restoring cr8 independently of the >>> rest of the Local APIC state isn't a clever thing to be doing. >>> >>> Delete the suspend/resume cr8 handling, which shrinks the size of struct >>> saved_context, and allows for the removal of both PVOPS. >> I think removing the interface for CR8 writes is also good to avoid >> potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task >> Priorities between CR8 and APIC”): >> >> "Operating software should implement either direct APIC TPR updates or CR8 >> style TPR updates but not mix them. Software can use a serializing >> instruction (for example, CPUID) to serialize updates between MOV CR8 and >> stores to the APIC.” >> >> And native_write_cr8() did not even issue a serializing instruction. > > Given its location, the one write_cr8() is bounded by two serialising > operations, so is safe in practice. That’s what the “potential” in "potential correctness issues” means :)
Re: [PATCH] vmw_balloon: Remove Julien from the maintainers list
> On Jul 14, 2019, at 6:41 AM, Julien Freche wrote: > > > > On Jul 14, 2019, at 12:18 AM, Nadav Amit wrote: > >>> On Jul 13, 2019, at 7:49 AM, Greg Kroah-Hartman >>> wrote: >>> >>>> On Tue, Jul 02, 2019 at 03:05:19AM -0700, Nadav Amit wrote: >>>> Julien will not be a maintainer anymore. >>>> >>>> Signed-off-by: Nadav Amit >>>> --- >>>> MAINTAINERS | 1 - >>>> 1 file changed, 1 deletion(-) >>>> >>>> diff --git a/MAINTAINERS b/MAINTAINERS >>>> index 01a52fc964da..f85874b1e653 100644 >>>> --- a/MAINTAINERS >>>> +++ b/MAINTAINERS >>>> @@ -16886,7 +16886,6 @@ F:drivers/vme/ >>>> F:include/linux/vme* >>>> >>>> VMWARE BALLOON DRIVER >>>> -M:Julien Freche >>> >>> I need an ack/something from Julien please. >> >> Julien, can I please have you Acked-by? > > The change looks good to me, Nadav. Thanks for updating the maintainer list. Thanks, Julien. Greg, do you want a v2 with an Acked-by? [ Note that Julien's email address will be different than the one on the MAINTAINER list ]
Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks
> On Jul 15, 2019, at 8:16 AM, Andrew Cooper wrote: > > There is a lot of infrastructure for functionality which is used > exclusively in __{save,restore}_processor_state() on the suspend/resume > path. > > cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by > lapic_{suspend,resume}(). Saving and restoring cr8 independently of the > rest of the Local APIC state isn't a clever thing to be doing. > > Delete the suspend/resume cr8 handling, which shrinks the size of struct > saved_context, and allows for the removal of both PVOPS. I think removing the interface for CR8 writes is also good to avoid potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task Priorities between CR8 and APIC”): "Operating software should implement either direct APIC TPR updates or CR8 style TPR updates but not mix them. Software can use a serializing instruction (for example, CPUID) to serialize updates between MOV CR8 and stores to the APIC.” And native_write_cr8() did not even issue a serializing instruction.
Re: [PATCH] x86/apic: Initialize TPR to block interrupts 16-31
> On Jul 14, 2019, at 8:23 AM, Andy Lutomirski wrote: > > The APIC, per spec, is fundamentally confused and thinks that > interrupt vectors 16-31 are valid. This makes no sense -- the CPU > reserves vectors 0-31 for exceptions (faults, traps, etc). > Obviously, no device should actually produce an interrupt with > vector 16-31, but we can improve robustness by setting the APIC TPR > class to 1, which will prevent delivery of an interrupt with a > vector below 32. > > Note: this is *not* intended as a security measure against attackers > who control malicious hardware. Any PCI or similar hardware that > can be controlled by an attacker MUST be behind a functional IOMMU > that remaps interrupts. The purpose of this patch is to reduce the > chance that a certain class of device malfunctions crashes the > kernel in hard-to-debug ways. > > Cc: Nadav Amit > Cc: Stephane Eranian > Cc: Feng Tang > Suggested-by: Andrew Cooper > Signed-off-by: Andy Lutomirski > --- > arch/x86/kernel/apic/apic.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c > index 177aa8ef2afa..ff31322f8839 100644 > --- a/arch/x86/kernel/apic/apic.c > +++ b/arch/x86/kernel/apic/apic.c > @@ -1531,11 +1531,14 @@ static void setup_local_APIC(void) > #endif > > /* > - * Set Task Priority to 'accept all'. We never change this > - * later on. > + * Set Task Priority to 'accept all except vectors 0-31'. An APIC > + * vector in the 16-31 range could be delivered if TPR == 0, but we > + * would think it's an exception and terrible things will happen. We > + * never change this later on. >*/ > value = apic_read(APIC_TASKPRI); > value &= ~APIC_TPRI_MASK; > + value |= 0x10; > apic_write(APIC_TASKPRI, value); > > apic_pending_intr_clear(); It looks fine, and indeed it seems that writes to APIC_TASKPRI and CR8 are not overwriting this value. Yet, the fact that if someone overwrites with zero (or does not restore) the TPR will not produce any warning is not that great. Not that I know what the right course of action is (adding checks in write_cr8()? but then what about if APIC_TASKPRI is not restored after some APIC reset?)
Re: [PATCH] vmw_balloon: Remove Julien from the maintainers list
> On Jul 13, 2019, at 7:49 AM, Greg Kroah-Hartman > wrote: > > On Tue, Jul 02, 2019 at 03:05:19AM -0700, Nadav Amit wrote: >> Julien will not be a maintainer anymore. >> >> Signed-off-by: Nadav Amit >> --- >> MAINTAINERS | 1 - >> 1 file changed, 1 deletion(-) >> >> diff --git a/MAINTAINERS b/MAINTAINERS >> index 01a52fc964da..f85874b1e653 100644 >> --- a/MAINTAINERS >> +++ b/MAINTAINERS >> @@ -16886,7 +16886,6 @@ F: drivers/vme/ >> F: include/linux/vme* >> >> VMWARE BALLOON DRIVER >> -M: Julien Freche > > I need an ack/something from Julien please. Julien, can I please have you Acked-by?
Re: [GIT PULL] x86/topology changes for v5.3
> On Jul 11, 2019, at 8:08 AM, Kees Cook wrote: > > On Thu, Jul 11, 2019 at 10:01:34AM +0200, Peter Zijlstra wrote: >> On Thu, Jul 11, 2019 at 07:11:19AM +0000, Nadav Amit wrote: >>>> On Jul 10, 2019, at 7:22 AM, Jiri Kosina wrote: >>>> >>>> On Wed, 10 Jul 2019, Peter Zijlstra wrote: >>>> >>>>> If we mark the key as RO after init, and then try and modify the key to >>>>> link module usage sites, things might go bang as described. >>>>> >>>>> Thanks! >>>>> >>>>> >>>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c >>>>> index 27d7864e7252..5bf7a8354da2 100644 >>>>> --- a/arch/x86/kernel/cpu/common.c >>>>> +++ b/arch/x86/kernel/cpu/common.c >>>>> @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct >>>>> cpuinfo_x86 *c) >>>>> cr4_clear_bits(X86_CR4_UMIP); >>>>> } >>>>> >>>>> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning); >>>>> +DEFINE_STATIC_KEY_FALSE(cr_pinning); >>>> >>>> Good catch, I guess that is going to fix it. >>>> >>>> At the same time though, it sort of destroys the original intent of Kees' >>>> patch, right? The exploits will just have to call static_key_disable() >>>> prior to calling native_write_cr4() again, and the protection is gone. >>> >>> Even with DEFINE_STATIC_KEY_FALSE_RO(), I presume you can just call >>> set_memory_rw(), make the page that holds the key writable, and then call >>> static_key_disable(), followed by a call to native_write_cr4(). >> >> Or call text_poke_bp() with the right set of arguments. > > Right -- the point is to make it defended against an arbitrary write, > not arbitrary execution. Nothing is safe from arbitrary exec, but we can > do our due diligence on making things read-only. I don’t understand. In the PoC that motivated this this patch [1], the attacker gained the ability to call a function, control its first argument and used it to disable SMEP/SMAP by calling native_write_cr4(), which he did before doing an arbitrary write (another ability he gain). After this patch, the attacker can instead call three functions, and by controlling their first arguments (*) disable SMEP. I didn’t see something in the motivating PoC that prevented calling 3 functions one at a time. So it seems to me that it raised the bar for an attack by very little. -- (*) set_memory_rw() has a second argument - the number of pages - but many non-zero values may be fine (if not a warning from __cpa_process_fault() might appear). [1] https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html
Re: [GIT PULL] x86/topology changes for v5.3
> On Jul 10, 2019, at 7:22 AM, Jiri Kosina wrote: > > On Wed, 10 Jul 2019, Peter Zijlstra wrote: > >> If we mark the key as RO after init, and then try and modify the key to >> link module usage sites, things might go bang as described. >> >> Thanks! >> >> >> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c >> index 27d7864e7252..5bf7a8354da2 100644 >> --- a/arch/x86/kernel/cpu/common.c >> +++ b/arch/x86/kernel/cpu/common.c >> @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct >> cpuinfo_x86 *c) >> cr4_clear_bits(X86_CR4_UMIP); >> } >> >> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning); >> +DEFINE_STATIC_KEY_FALSE(cr_pinning); > > Good catch, I guess that is going to fix it. > > At the same time though, it sort of destroys the original intent of Kees' > patch, right? The exploits will just have to call static_key_disable() > prior to calling native_write_cr4() again, and the protection is gone. Even with DEFINE_STATIC_KEY_FALSE_RO(), I presume you can just call set_memory_rw(), make the page that holds the key writable, and then call static_key_disable(), followed by a call to native_write_cr4().
Re: [PATCH v2 8/9] x86/mm/tlb: Remove UV special case
> On Jul 9, 2019, at 1:29 PM, Mike Travis wrote: > > > > On 7/9/2019 1:09 PM, Russ Anderson wrote: >> On Tue, Jul 09, 2019 at 09:50:27PM +0200, Thomas Gleixner wrote: >>> On Tue, 2 Jul 2019, Nadav Amit wrote: >>> >>>> SGI UV support is outdated and not maintained, and it is not clear how >>>> it performs relatively to non-UV. Remove the code to simplify the code. >>> >>> You should at least Cc the SGI/HP folks on that. They are still >>> around. Done so. >> Thanks Thomas. The SGI UV is now HPE Superdome Flex and is >> very much still supported. >> Thanks. >>>> Cc: Peter Zijlstra >>>> Cc: Dave Hansen >>>> Suggested-by: Andy Lutomirski >>>> Signed-off-by: Nadav Amit >>>> --- >>>> arch/x86/mm/tlb.c | 25 - >>>> 1 file changed, 25 deletions(-) >>>> >>>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c >>>> index b47a71820f35..64afe1215495 100644 >>>> --- a/arch/x86/mm/tlb.c >>>> +++ b/arch/x86/mm/tlb.c >>>> @@ -689,31 +689,6 @@ void native_flush_tlb_multi(const struct cpumask >>>> *cpumask, >>>>trace_tlb_flush(TLB_REMOTE_SEND_IPI, >>>>(info->end - info->start) >> PAGE_SHIFT); >>>> - if (is_uv_system()) { >>>> - /* >>>> - * This whole special case is confused. UV has a "Broadcast >>>> - * Assist Unit", which seems to be a fancy way to send IPIs. >>>> - * Back when x86 used an explicit TLB flush IPI, UV was >>>> - * optimized to use its own mechanism. These days, x86 uses >>>> - * smp_call_function_many(), but UV still uses a manual IPI, >>>> - * and that IPI's action is out of date -- it does a manual >>>> - * flush instead of calling flush_tlb_func_remote(). This >>>> - * means that the percpu tlb_gen variables won't be updated >>>> - * and we'll do pointless flushes on future context switches. >>>> - * >>>> - * Rather than hooking native_flush_tlb_multi() here, I think >>>> - * that UV should be updated so that smp_call_function_many(), >>>> - * etc, are optimal on UV. >>>> - */ > > I thought this change was already proposed a bit ago and we acked it > awhile back. Also the replacement functionality is being worked on but it > is more complex. The smp call many has to support all the reasons why it’s > called and not just the tlb shoot downs as is the current BAU case. Sorry for not cc’ing you before. In the meanwhile, can you give an explicit acked-by? (I couldn’t find the previous patch you regarded.) Thanks, Nadav
Re: linux-next: manual merge of the char-misc tree with the driver-core tree
> On Jul 8, 2019, at 4:20 PM, Stephen Rothwell wrote: > > Hi all, > > On Thu, 13 Jun 2019 15:53:44 +1000 Stephen Rothwell > wrote: >> Today's linux-next merge of the char-misc tree got a conflict in: >> >> drivers/misc/vmw_balloon.c >> >> between commit: >> >> 225afca60b8a ("vmw_balloon: no need to check return value of debugfs_create >> functions") >> >> from the driver-core tree and commits: >> >> 83a8afa72e9c ("vmw_balloon: Compaction support") >> 5d1a86ecf328 ("vmw_balloon: Add memory shrinker") >> >> from the char-misc tree. >> >> I fixed it up (see below) and can carry the fix as necessary. This >> is now fixed as far as linux-next is concerned, but any non trivial >> conflicts should be mentioned to your upstream maintainer when your tree >> is submitted for merging. You may also want to consider cooperating >> with the maintainer of the conflicting tree to minimise any particularly >> complex conflicts. >> >> -- >> Cheers, >> Stephen Rothwell >> >> diff --cc drivers/misc/vmw_balloon.c >> index fdf5ad757226,043eed845246.. >> --- a/drivers/misc/vmw_balloon.c >> +++ b/drivers/misc/vmw_balloon.c >> @@@ -1553,15 -1942,26 +1932,24 @@@ static int __init vmballoon_init(void >> if (x86_hyper_type != X86_HYPER_VMWARE) >> return -ENODEV; >> >> -for (page_size = VMW_BALLOON_4K_PAGE; >> - page_size <= VMW_BALLOON_LAST_SIZE; page_size++) >> -INIT_LIST_HEAD(_sizes[page_size].pages); >> - >> - >> INIT_DELAYED_WORK(, vmballoon_work); >> >> +error = vmballoon_register_shrinker(); >> +if (error) >> +goto fail; >> + >> -error = vmballoon_debugfs_init(); >> -if (error) >> -goto fail; >> +vmballoon_debugfs_init(); >> >> +/* >> + * Initialization of compaction must be done after the call to >> + * balloon_devinfo_init() . >> + */ >> +balloon_devinfo_init(_dev_info); >> +error = vmballoon_compaction_init(); >> +if (error) >> +goto fail; >> + >> +INIT_LIST_HEAD(_pages); >> spin_lock_init(_lock); >> init_rwsem(_sem); >> balloon.vmci_doorbell = VMCI_INVALID_HANDLE; > > I am still getting this conflict (the commit ids may have changed). > Just a reminder in case you think Linus may need to know. Greg, Can you please take care of it, as you are the maintainer of char-misc and the committer of the patch?
Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust
> On Jul 5, 2019, at 8:47 AM, Andrew Cooper wrote: > > On 04/07/2019 16:51, Thomas Gleixner wrote: >> 2) The loop termination logic is interesting at best. >> >> If the machine has no TSC or cpu_khz is not known yet it tries 1 >> million times to ack stale IRR/ISR bits. What? >> >> With TSC it uses the TSC to calculate the loop termination. It takes a >> timestamp at entry and terminates the loop when: >> >>(rdtsc() - start_timestamp) >= (cpu_hkz << 10) >> >> That's roughly one second. >> >> Both methods are problematic. The APIC has 256 vectors, which means >> that in theory max. 256 IRR/ISR bits can be set. In practice this is >> impossible as the first 32 vectors are reserved and not affected and >> the chance that more than a few bits are set is close to zero. > > [Disclaimer. I talked to Thomas in private first, and he asked me to > post this publicly as the CVE is almost a decade old already.] > > I'm afraid that this isn't quite true. > > In terms of IDT vectors, the first 32 are reserved for exceptions, but > only the first 16 are reserved in the LAPIC. Vectors 16-31 are fair > game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors). > > In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which > I'm disappointed to see wasn't shared with other software vendors at the > time. IIRC (and from skimming the CVE again) the basic problem in Xen was that MSIs can be used when devices are assigned to generate IRQs with arbitrary vectors. The mitigation was to require interrupt remapping to be enabled in the IOMMU when IOMMU is used for DMA remapping (i.e., device assignment). Are you concerned about this case, additional concrete ones, or is it about security hardening? (or am I missing something?)
Re: [PATCH] KVM: LAPIC: ARBPRI is a reserved register for x2APIC
> On Jul 5, 2019, at 6:43 AM, Paolo Bonzini wrote: > > On 05/07/19 15:37, Nadav Amit wrote: >>> On Jul 5, 2019, at 5:14 AM, Paolo Bonzini wrote: >>> >>> kvm-unit-tests were adjusted to match bare metal behavior, but KVM >>> itself was not doing what bare metal does; fix that. >>> >>> Signed-off-by: Paolo Bonzini >> >> Reported-by ? > > I found it myself while running the tests, was there a report too? Perhaps it is in my mind. I thought that fixing a test is equivalent to a bug report.
Re: [PATCH] KVM: LAPIC: ARBPRI is a reserved register for x2APIC
> On Jul 5, 2019, at 5:14 AM, Paolo Bonzini wrote: > > kvm-unit-tests were adjusted to match bare metal behavior, but KVM > itself was not doing what bare metal does; fix that. > > Signed-off-by: Paolo Bonzini Reported-by ?
Re: [patch V2 21/25] x86/smp: Enhance native_send_call_func_ipi()
> On Jul 4, 2019, at 8:52 AM, Thomas Gleixner wrote: > > Nadav noticed that the cpumask allocations in native_send_call_func_ipi() > are noticeable in microbenchmarks. > > Use the new cpumask_or_equal() function to simplify the decision whether > the supplied target CPU mask is either equal to cpu_online_mask or equal to > cpu_online_mask except for the CPU on which the function is invoked. > > cpumask_or_equal() or's the target mask and the cpumask of the current CPU > together and compares it to cpu_online_mask. > > If the result is false, use the mask based IPI function, otherwise check > whether the current CPU is set in the target mask and invoke either the > send_IPI_all() or the send_IPI_allbutselt() APIC callback. > > Make the shorthand decision also depend on the static key which enables > shorthand mode. That allows to remove the extra cpumask comparison with > cpu_callout_mask. > > Reported-by: Nadav Amit > Signed-off-by: Thomas Gleixner > --- > V2: New patch > --- > arch/x86/kernel/apic/ipi.c | 24 +++- > 1 file changed, 11 insertions(+), 13 deletions(-) > > --- a/arch/x86/kernel/apic/ipi.c > +++ b/arch/x86/kernel/apic/ipi.c > @@ -83,23 +83,21 @@ void native_send_call_func_single_ipi(in > > void native_send_call_func_ipi(const struct cpumask *mask) > { > - cpumask_var_t allbutself; > + if (static_branch_likely(_use_ipi_shorthand)) { > + unsigned int cpu = smp_processor_id(); > > - if (!alloc_cpumask_var(, GFP_ATOMIC)) { > - apic->send_IPI_mask(mask, CALL_FUNCTION_VECTOR); > + if (!cpumask_or_equal(mask, cpumask_of(cpu), cpu_online_mask)) > + goto sendmask; > + > + if (cpumask_test_cpu(cpu, mask)) > + apic->send_IPI_all(CALL_FUNCTION_VECTOR); > + else if (num_online_cpus() > 1) > + apic->send_IPI_allbutself(CALL_FUNCTION_VECTOR); > return; > } > > - cpumask_copy(allbutself, cpu_online_mask); > - cpumask_clear_cpu(smp_processor_id(), allbutself); > - > - if (cpumask_equal(mask, allbutself) && > - cpumask_equal(cpu_online_mask, cpu_callout_mask)) > - apic->send_IPI_allbutself(CALL_FUNCTION_VECTOR); > - else > - apic->send_IPI_mask(mask, CALL_FUNCTION_VECTOR); > - > - free_cpumask_var(allbutself); > +sendmask: > + apic->send_IPI_mask(mask, CALL_FUNCTION_VECTOR); > } > > #endif /* CONFIG_SMP */ It does look better and simpler than my solution.