Re: [RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown

2019-08-27 Thread Nadav Amit
> On Aug 27, 2019, at 4:07 PM, Andy Lutomirski  wrote:
> 
> On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit  wrote:
>> When a shootdown is initiated, the initiating CPU has cycles to burn as
>> it waits for the responding CPUs to receive the IPI and acknowledge it.
>> In these cycles it is better to flush the user page-tables using
>> INVPCID, instead of deferring the TLB flush.
>> 
>> The best way to figure out whether there are cycles to burn is arguably
>> to expose from the SMP layer when an acknowledgment is received.
>> However, this would break some abstractions.
>> 
>> Instead, use a simpler solution: the initiating CPU of a TLB shootdown
>> would not defer PTI flushes. It is not always a win, relatively to
>> deferring user page-table flushes, but it prevents performance
>> regression.
>> 
>> Signed-off-by: Nadav Amit 
>> ---
>> arch/x86/include/asm/tlbflush.h |  1 +
>> arch/x86/mm/tlb.c   | 10 +-
>> 2 files changed, 10 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/x86/include/asm/tlbflush.h 
>> b/arch/x86/include/asm/tlbflush.h
>> index da56aa3ccd07..066b3804f876 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -573,6 +573,7 @@ struct flush_tlb_info {
>>unsigned intinitiating_cpu;
>>u8  stride_shift;
>>u8  freed_tables;
>> +   u8  shootdown;
> 
> I find the name "shootdown" to be confusing.  How about "more_than_one_cpu”?

I think the current semantic is more of “includes remote cpus”. How about
calling it “local_only”, and negating its value?

Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes

2019-08-27 Thread Nadav Amit
> On Aug 27, 2019, at 4:13 PM, Andy Lutomirski  wrote:
> 
> On Fri, Aug 23, 2019 at 11:13 PM Nadav Amit  wrote:
>> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
>> flush the user page-tables when PTI is enabled therefore introduces
>> significant overhead.
>> 
>> Instead, unless page-tables are released, it is possible to defer the
>> flushing of the user page-tables until the time the code returns to
>> userspace. These page tables are not in use, so deferring them is not a
>> security hazard.
> 
> I agree and, in fact, I argued against ever using INVPCID in the
> original PTI code.
> 
> However, I don't see what freeing page tables has to do with this.  If
> the CPU can actually do speculative page walks based on the contents
> of non-current-PCID TLB entries, then we have major problems, since we
> don't actively flush the TLB for non-running mms at all.

That was not my concern.

> 
> I suppose that, if we free a page table, then we can't activate the
> PCID by writing to CR3 before flushing things.  But we can still defer
> the flush and just set the flush bit when we write to CR3.

This was my concern. I can change the behavior so the code would flush the
whole TLB instead. I just tried not to change the existing behavior too
much.



Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI

2019-08-27 Thread Nadav Amit
> On Aug 27, 2019, at 4:18 PM, Andy Lutomirski  wrote:
> 
> On Fri, Aug 23, 2019 at 11:07 PM Nadav Amit  wrote:
>> INVPCID is considerably slower than INVLPG of a single PTE, but it is
>> currently used to flush PTEs in the user page-table when PTI is used.
>> 
>> Instead, it is possible to defer TLB flushes until after the user
>> page-tables are loaded. Preventing speculation over the TLB flushes
>> should keep the whole thing safe. In some cases, deferring TLB flushes
>> in such a way can result in more full TLB flushes, but arguably this
>> behavior is oftentimes beneficial.
> 
> I have a somewhat horrible suggestion.
> 
> Would it make sense to refactor this so that it works for user *and*
> kernel tables?  In particular, if we flush a *kernel* mapping (vfree,
> vunmap, set_memory_ro, etc), we shouldn't need to send an IPI to a
> task that is running user code to flush most kernel mappings or even
> to free kernel pagetables.  The same trick could be done if we treat
> idle like user mode for this purpose.
> 
> In code, this could mostly consist of changing all the "user" data
> structures involved to something like struct deferred_flush_info and
> having one for user and one for kernel.
> 
> I think this is horrible because it will enable certain workloads to
> work considerably faster with PTI on than with PTI off, and that would
> be a barely excusable moral failing. :-p
> 
> For what it's worth, other than register clobber issues, the whole
> "switch CR3 for PTI" logic ought to be doable in C.  I don't know a
> priori whether that would end up being an improvement.

I implemented (and have not yet sent) another TLB deferring mechanism. It is
intended for user mappings and not kernel one, but I think your suggestion
shares some similar underlying rationale, and therefore challenges and
solutions. Let me rephrase what you say to ensure we are on the same page.

The basic idea is context-tracking to check whether each CPU is in kernel or
user mode. Accordingly, TLB flushes can be deferred, but I don’t see that
this solution is limited to PTI. There are 2 possible reasons, according to
my understanding, that you limit the discussion to PTI:

1. PTI provides clear boundaries when user and kernel mappings are used. I
am not sure that privilege-levels (and SMAP) do not do the same.

2. CR3 switching already imposes a memory barrier, which eliminates most of
the cost of implementing such scheme which requires something which is
similar to:

write new context (kernel/user)
mb();
if (need_flush) flush;

I do agree that PTI addresses (2), but there is another problem. A
reasonable implementation would store in a per-cpu state whether each CPU is
in user/kernel, and the TLB shootdown initiator CPU would check the state to
decide whether an IPI is needed. This means that pretty much every TLB
shutdown would incur a cache-miss per-target CPU. This might cause
performance regressions, at least in some cases.

Admittedly, I did implement something similar (not sent) for user mappings:
defer all TLB flushes and shootdowns if the CPUs are known to be in kernel
mode. But I limited myself for certain cases, specifically “long” syscalls
that are already likely to cause a TLB flush (e.g., msync()). I am not sure
that tracking each CPU entry/exit would be a good idea.

I will give some more thought about kernel mapping invalidations, which I
did not think about enough. I tried to send what I considered “saner” and
cleaner patches first. I still have patches I mentioned here, the
async-flushes, and another patch to avoid local TLB flush on CoW and instead
accessing the data.

Re: [RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes

2019-08-27 Thread Nadav Amit
> On Aug 27, 2019, at 11:28 AM, Dave Hansen  wrote:
> 
> On 8/23/19 3:52 PM, Nadav Amit wrote:
>> INVPCID is considerably slower than INVLPG of a single PTE. Using it to
>> flush the user page-tables when PTI is enabled therefore introduces
>> significant overhead.
> 
> I'm not sure this is worth all the churn, especially in the entry code.
> For large flushes (> tlb_single_page_flush_ceiling), we don't do
> INVPCIDs in the first place.

It is possible to jump from flush_tlb_func() into the trampoline page,
instead of flushing the TLB in the entry code. However, it induces higher
overhead (switching CR3s), so it will only be useful if multiple TLB entries
are flushed at once. It also prevents exploiting opportunities of promoting
individual entry flushes into a full-TLB flush when multiple flushes are
issued or when context switch takes place before returning-to-user-space.

There are cases/workloads that flush multiple (but not too many) TLB entries
on every syscall, for instance issuing msync() or running Apache webserver.
So I am not sure that tlb_single_page_flush_ceiling saves the day. Besides,
you may want to recalibrate (lower) tlb_single_page_flush_ceiling when PTI
is used.

> I'd really want to understand what the heck is going on that makes
> INVPCID so slow, first.

INVPCID-single is slow (even more than 133 cycles slower than INVLPG that
you mentioned; I don’t have the numbers if front of me). I thought that this
is a known fact, although, obviously, it does not make much sense.



Re: [PATCH] vmw_balloon: Fix offline page marking with compaction

2019-08-26 Thread Nadav Amit
> On Aug 21, 2019, at 1:30 AM, David Hildenbrand  wrote:
> 
> On 20.08.19 18:01, Nadav Amit wrote:
>> The compaction code already marks pages as offline when it enqueues
>> pages in the ballooned page list, and removes the mapping when the pages
>> are removed from the list. VMware balloon also updates the flags,
>> instead of letting the balloon-compaction logic handle it, which causes
>> the assertion VM_BUG_ON_PAGE(!PageOffline(page)) to fire, when
>> __ClearPageOffline is called the second time. This causes the following
>> crash.
>> 
>> [  487.104520] kernel BUG at include/linux/page-flags.h:749!
>> [  487.106364] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC PTI
>> [  487.107681] CPU: 7 PID: 1106 Comm: kworker/7:3 Not tainted 
>> 5.3.0-rc5balloon #227
>> [  487.109196] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
>> Desktop Reference Platform, BIOS 6.00 12/12/2018
>> [  487.111452] Workqueue: events_freezable vmballoon_work [vmw_balloon]
>> [  487.112779] RIP: 0010:vmballoon_release_page_list+0xaa/0x100 [vmw_balloon]
>> [  487.114200] Code: fe 48 c1 e7 06 4c 01 c7 8b 47 30 41 89 c1 41 81 e1 00 
>> 01 00 f0 41 81 f9 00 00 00 f0 74 d3 48 c7 c6 08 a1 a1 c0 e8 06 0d e7 ea <0f> 
>> 0b 44 89 f6 4c 89 c7 e8 49 9c e9 ea 49 8d 75 08 49 8b 45 08 4d
>> [  487.118033] RSP: 0018:b82f012bbc98 EFLAGS: 00010246
>> [  487.119135] RAX: 0037 RBX: 0001 RCX: 
>> 0006
>> [  487.120601] RDX:  RSI:  RDI: 
>> 9a85b6bd7620
>> [  487.122071] RBP: b82f012bbcc0 R08: 0001 R09: 
>> 
>> [  487.123536] R10:  R11:  R12: 
>> b82f012bbd00
>> [  487.125002] R13: e97f4598d9c0 R14:  R15: 
>> b82f012bbd34
>> [  487.126463] FS:  () GS:9a85b6bc() 
>> knlGS:
>> [  487.128110] CS:  0010 DS:  ES:  CR0: 80050033
>> [  487.129316] CR2: 7ffe6e413ea0 CR3: 000230b18001 CR4: 
>> 003606e0
>> [  487.130812] DR0:  DR1:  DR2: 
>> 
>> [  487.132283] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [  487.133749] Call Trace:
>> [  487.134333]  vmballoon_deflate+0x22c/0x390 [vmw_balloon]
>> [  487.135468]  vmballoon_work+0x6e7/0x913 [vmw_balloon]
>> [  487.136711]  ? process_one_work+0x21a/0x5e0
>> [  487.138581]  process_one_work+0x298/0x5e0
>> [  487.139926]  ? vmballoon_migratepage+0x310/0x310 [vmw_balloon]
>> [  487.141610]  ? process_one_work+0x298/0x5e0
>> [  487.143053]  worker_thread+0x41/0x400
>> [  487.144389]  kthread+0x12b/0x150
>> [  487.145582]  ? process_one_work+0x5e0/0x5e0
>> [  487.146937]  ? kthread_create_on_node+0x60/0x60
>> [  487.148637]  ret_from_fork+0x3a/0x50
>> 
>> Fix it by updating the PageOffline indication only when a 2MB page is
>> enqueued and dequeued. The 4KB pages will be handled correctly by the
>> balloon compaction logic.
>> 
>> Fixes: 83a8afa72e9c ("vmw_balloon: Compaction support")
>> Cc: David Hildenbrand 
>> Reported-by: Thomas Hellstrom 
>> Signed-off-by: Nadav Amit 
>> ---
>> drivers/misc/vmw_balloon.c | 10 --
>> 1 file changed, 8 insertions(+), 2 deletions(-)
>> 
>> diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
>> index 8840299420e0..5e6be1527571 100644
>> --- a/drivers/misc/vmw_balloon.c
>> +++ b/drivers/misc/vmw_balloon.c
>> @@ -691,7 +691,6 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
>>  }
>> 
>>  if (page) {
>> -vmballoon_mark_page_offline(page, ctl->page_size);
>>  /* Success. Add the page to the list and continue. */
>>  list_add(>lru, >pages);
>>  continue;
>> @@ -930,7 +929,6 @@ static void vmballoon_release_page_list(struct list_head 
>> *page_list,
>> 
>>  list_for_each_entry_safe(page, tmp, page_list, lru) {
>>  list_del(>lru);
>> -vmballoon_mark_page_online(page, page_size);
>>  __free_pages(page, vmballoon_page_order(page_size));
>>  }
>> 
>> @@ -1005,6 +1003,7 @@ static void vmballoon_enqueue_page_list(struct 
>> vmballoon *b,
>>  enum vmballoon_page_size_type page_size)
>> {
>>  unsigned long flags;
>> +struct page *page;
>> 
>>  if (page_

Re: [RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface

2019-08-26 Thread Nadav Amit
> On Aug 26, 2019, at 12:51 AM, Juergen Gross  wrote:
> 
> On 24.08.19 00:52, Nadav Amit wrote:
>> __flush_tlb_one_user() currently flushes a single entry, and flushes it
>> both in the kernel and user page-tables, when PTI is enabled.
>> Change __flush_tlb_one_user() and related interfaces into
>> __flush_tlb_range() that flushes a range and does not flush the user
>> page-table.
>> This refactoring is needed for the next patch, but regardless makes
>> sense and has several advantages. First, only Xen-PV, which does not
>> use PTI, implements the paravirtual interface of flush_tlb_one_user() so
>> nothing is broken by separating the user and kernel page-table flushes,
>> and the interface is more intuitive.
>> Second, INVLPG can flush unrelated mappings, and it is also a
>> serializing instruction. It is better to have a tight loop that flushes
>> the entries.
>> Third, currently __flush_tlb_one_kernel() also flushes the user
>> page-tables, which is not needed. This allows to avoid this redundant
>> flush.
>> Cc: Boris Ostrovsky 
>> Cc: Juergen Gross 
>> Cc: Stefano Stabellini 
>> Cc: xen-de...@lists.xenproject.org
>> Signed-off-by: Nadav Amit 
>> ---
>>  arch/x86/include/asm/paravirt.h   |  5 ++--
>>  arch/x86/include/asm/paravirt_types.h |  3 ++-
>>  arch/x86/include/asm/tlbflush.h   | 24 +
>>  arch/x86/kernel/paravirt.c|  7 ++---
>>  arch/x86/mm/tlb.c | 39 ++-
>>  arch/x86/xen/mmu_pv.c | 21 +--
>>  6 files changed, 62 insertions(+), 37 deletions(-)
> 
> ...
> 
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index 48f7c7eb4dbc..ed68657f5e77 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -1325,22 +1325,27 @@ static noinline void xen_flush_tlb(void)
>>  preempt_enable();
>>  }
>>  -static void xen_flush_tlb_one_user(unsigned long addr)
>> +static void xen_flush_tlb_range(unsigned long start, unsigned long end,
>> +u8 stride_shift)
>>  {
>>  struct mmuext_op *op;
>>  struct multicall_space mcs;
>> -
>> -trace_xen_mmu_flush_tlb_one_user(addr);
>> +unsigned long addr;
>>  preempt_disable();
>>  mcs = xen_mc_entry(sizeof(*op));
>>  op = mcs.args;
>> -op->cmd = MMUEXT_INVLPG_LOCAL;
>> -op->arg1.linear_addr = addr & PAGE_MASK;
>> -MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
>>  -   xen_mc_issue(PARAVIRT_LAZY_MMU);
>> +for (addr = start; addr < end; addr += 1ul << stride_shift) {
>> +trace_xen_mmu_flush_tlb_one_user(addr);
>> +
>> +op->cmd = MMUEXT_INVLPG_LOCAL;
>> +op->arg1.linear_addr = addr & PAGE_MASK;
>> +MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
>> +
>> +xen_mc_issue(PARAVIRT_LAZY_MMU);
>> +}
> 
> For this kind of usage (a loop) you should:
> 
> - replace the call of xen_mc_entry() with xen_mc_batch()
> - use xen_extend_mmuext_op() for each loop iteration
> - call xen_mc_issue() after the loop
> 
> Additionally I'd like you to replace trace_xen_mmu_flush_tlb_one_user()
> with trace_xen_mmu_flush_tlb_range() taking all three parameters and
> keep it where it was (out of the loop).
> 
> The paravirt parts seem to be okay.

Thanks for the quick response. I don’t think the preempt_disable/enable() is
needed in this case, but I kept them. Does the following match what you had
in mind?

---
 arch/x86/xen/mmu_pv.c  | 25 ++---
 include/trace/events/xen.h | 18 --
 2 files changed, 26 insertions(+), 17 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 48f7c7eb4dbc..faa01591df36 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -1325,20 +1325,23 @@ static noinline void xen_flush_tlb(void)
preempt_enable();
 }
 
-static void xen_flush_tlb_one_user(unsigned long addr)
+static void xen_flush_tlb_range(unsigned long start, unsigned long end,
+   u8 stride_shift)
 {
-   struct mmuext_op *op;
-   struct multicall_space mcs;
-
-   trace_xen_mmu_flush_tlb_one_user(addr);
+   struct mmuext_op op;
+   unsigned long addr;
 
preempt_disable();
 
-   mcs = xen_mc_entry(sizeof(*op));
-   op = mcs.args;
-   op->cmd = MMUEXT_INVLPG_LOCAL;
-   op->arg1.linear_addr = addr & PAGE_MASK;
-   MULTI_mmuext_op(mcs.mc, op, 1, NULL, DOMID_SELF);
+   xen_mc_batch();
+   op.cmd = MMUEXT_INVLPG_LOC

Re: [PATCH v4 1/9] smp: Run functions concurrently in smp_call_function_many()

2019-08-26 Thread Nadav Amit
> On Aug 23, 2019, at 3:41 PM, Nadav Amit  wrote:
> 
> Currently, on_each_cpu() and similar functions do not exploit the
> potential of concurrency: the function is first executed remotely and
> only then it is executed locally. Functions such as TLB flush can take
> considerable time, so this provides an opportunity for performance
> optimization.
> 
> To do so, introduce __smp_call_function_many(), which allows the callers
> to provide local and remote functions that should be executed, and run
> them concurrently. Keep smp_call_function_many() semantic as it is today
> for backward compatibility: the called function is not executed in this
> case locally.
> 
> __smp_call_function_many() does not use the optimized version for a
> single remote target that smp_call_function_single() implements. For
> synchronous function call, smp_call_function_single() keeps a
> call_single_data (which is used for synchronization) on the stack.
> Interestingly, it seems that not using this optimization provides
> greater performance improvements (greater speedup with a single remote
> target than with multiple ones). Presumably, holding data structures
> that are intended for synchronization on the stack can introduce
> overheads due to TLB misses and false-sharing when the stack is used for
> other purposes.
> 
> Adding support to run the functions concurrently required to remove a
> micro-optimization in on_each_cpu() that disabled/enabled IRQs instead
> of saving/restoring them. The benefit of running the local and remote
> code concurrently is expected to be greater.
> 
> Reviewed-by: Dave Hansen 
> Cc: Peter Zijlstra 
> Cc: Rik van Riel 
> Cc: Thomas Gleixner 
> Cc: Andy Lutomirski 
> Cc: Josh Poimboeuf 
> Signed-off-by: Nadav Amit 
> ---
> include/linux/smp.h |  34 ---
> kernel/smp.c| 138 +---
> 2 files changed, 92 insertions(+), 80 deletions(-)
> 
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 6fc856c9eda5..d18d54199635 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -32,11 +32,6 @@ extern unsigned int total_cpus;
> int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
>int wait);
> 
> -/*
> - * Call a function on all processors
> - */
> -void on_each_cpu(smp_call_func_t func, void *info, int wait);
> -
> /*
>  * Call a function on processors specified by mask, which might include
>  * the local one.
> @@ -44,6 +39,17 @@ void on_each_cpu(smp_call_func_t func, void *info, int 
> wait);
> void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
>   void *info, bool wait);
> 
> +/*
> + * Call a function on all processors.  May be used during early boot while
> + * early_boot_irqs_disabled is set.
> + */
> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
> +{
> + preempt_disable();
> + on_each_cpu_mask(cpu_online_mask, func, info, wait);
> + preempt_enable();
> +}

Err.. I made this change the last minute before sending, and apparently
forgot to build, since it does not build.

Let me know if there is anything else with this version, though.

Re: [PATCH] x86/mm: Do not split_large_page() for set_kernel_text_rw()

2019-08-26 Thread Nadav Amit
> On Aug 26, 2019, at 8:56 AM, Steven Rostedt  wrote:
> 
> On Mon, 26 Aug 2019 15:41:24 +0000
> Nadav Amit  wrote:
> 
>>> Anyway, I believe Nadav has some patches that converts ftrace to use
>>> the shadow page modification trick somewhere.  
>> 
>> For the record - here is my previous patch:
>> https://lkml.org/lkml/2018/12/5/211
> 
> FYI, when referencing older patches, please use lkml.kernel.org or
> lore.kernel.org, lkml.org is slow and obsolete.
> 
> ie. http://lkml.kernel.org/r/20181205013408.47725-9-na...@vmware.com

Will do so next time.



Re: [PATCH] x86/mm: Do not split_large_page() for set_kernel_text_rw()

2019-08-26 Thread Nadav Amit
> On Aug 26, 2019, at 4:33 AM, Steven Rostedt  wrote:
> 
> On Fri, 23 Aug 2019 11:36:37 +0200
> Peter Zijlstra  wrote:
> 
>> On Thu, Aug 22, 2019 at 10:23:35PM -0700, Song Liu wrote:
>>> As 4k pages check was removed from cpa [1], set_kernel_text_rw() leads to
>>> split_large_page() for all kernel text pages. This means a single kprobe
>>> will put all kernel text in 4k pages:
>>> 
>>>  root@ ~# grep 8100- /sys/kernel/debug/page_tables/kernel
>>>  0x8100-0x8240 20M  roPSE  x  pmd
>>> 
>>>  root@ ~# echo ONE_KPROBE >> /sys/kernel/debug/tracing/kprobe_events
>>>  root@ ~# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
>>> 
>>>  root@ ~# grep 8100- /sys/kernel/debug/page_tables/kernel
>>>  0x8100-0x8240 20M  ro x  pte
>>> 
>>> To fix this issue, introduce CPA_FLIP_TEXT_RW to bypass "Text RO" check
>>> in static_protections().
>>> 
>>> Two helper functions set_text_rw() and set_text_ro() are added to flip
>>> _PAGE_RW bit for kernel text.
>>> 
>>> [1] commit 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely") 
>>>  
>> 
>> ARGH; so this is because ftrace flips the whole kernel range to RW and
>> back for giggles? I'm thinking _that_ is a bug, it's a clear W^X
>> violation.
> 
> Since ftrace did this way before text_poke existed and way before
> anybody cared (back in 2007), it's not really a bug.
> 
> Anyway, I believe Nadav has some patches that converts ftrace to use
> the shadow page modification trick somewhere.

For the record - here is my previous patch:
https://lkml.org/lkml/2018/12/5/211



[RFC PATCH v2 1/3] x86/mm/tlb: Change __flush_tlb_one_user interface

2019-08-24 Thread Nadav Amit
__flush_tlb_one_user() currently flushes a single entry, and flushes it
both in the kernel and user page-tables, when PTI is enabled.

Change __flush_tlb_one_user() and related interfaces into
__flush_tlb_range() that flushes a range and does not flush the user
page-table.

This refactoring is needed for the next patch, but regardless makes
sense and has several advantages. First, only Xen-PV, which does not
use PTI, implements the paravirtual interface of flush_tlb_one_user() so
nothing is broken by separating the user and kernel page-table flushes,
and the interface is more intuitive.

Second, INVLPG can flush unrelated mappings, and it is also a
serializing instruction. It is better to have a tight loop that flushes
the entries.

Third, currently __flush_tlb_one_kernel() also flushes the user
page-tables, which is not needed. This allows to avoid this redundant
flush.

Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/paravirt.h   |  5 ++--
 arch/x86/include/asm/paravirt_types.h |  3 ++-
 arch/x86/include/asm/tlbflush.h   | 24 +
 arch/x86/kernel/paravirt.c|  7 ++---
 arch/x86/mm/tlb.c | 39 ++-
 arch/x86/xen/mmu_pv.c | 21 +--
 6 files changed, 62 insertions(+), 37 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index bc4829c9b3f9..7347328eacd3 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -57,9 +57,10 @@ static inline void __flush_tlb_global(void)
PVOP_VCALL0(mmu.flush_tlb_kernel);
 }
 
-static inline void __flush_tlb_one_user(unsigned long addr)
+static inline void __flush_tlb_range(unsigned long start, unsigned long end,
+u8 stride_shift)
 {
-   PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
+   PVOP_VCALL3(mmu.flush_tlb_range, start, end, stride_shift);
 }
 
 static inline void flush_tlb_multi(const struct cpumask *cpumask,
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 63fa751344bf..a87a5f236251 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -205,7 +205,8 @@ struct pv_mmu_ops {
/* TLB operations */
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
-   void (*flush_tlb_one_user)(unsigned long addr);
+   void (*flush_tlb_range)(unsigned long start, unsigned long end,
+   u8 stride_shift);
void (*flush_tlb_multi)(const struct cpumask *cpus,
const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1f88ea410ff3..421bc82504e2 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -145,7 +145,7 @@ static inline unsigned long build_cr3_noflush(pgd_t *pgd, 
u16 asid)
 #else
 #define __flush_tlb() __native_flush_tlb()
 #define __flush_tlb_global() __native_flush_tlb_global()
-#define __flush_tlb_one_user(addr) __native_flush_tlb_one_user(addr)
+#define __flush_tlb_range(start, end, stride_shift) 
__native_flush_tlb_range(start, end, stride_shift)
 #endif
 
 struct tlb_context {
@@ -454,23 +454,13 @@ static inline void __native_flush_tlb_global(void)
 /*
  * flush one page in the user mapping
  */
-static inline void __native_flush_tlb_one_user(unsigned long addr)
+static inline void __native_flush_tlb_range(unsigned long start,
+   unsigned long end, u8 stride_shift)
 {
-   u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+   unsigned long addr;
 
-   asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
-
-   if (!static_cpu_has(X86_FEATURE_PTI))
-   return;
-
-   /*
-* Some platforms #GP if we call invpcid(type=1/2) before CR4.PCIDE=1.
-* Just use invalidate_user_asid() in case we are called early.
-*/
-   if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE))
-   invalidate_user_asid(loaded_mm_asid);
-   else
-   invpcid_flush_one(user_pcid(loaded_mm_asid), addr);
+   for (addr = start; addr < end; addr += 1ul << stride_shift)
+   asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
 
 /*
@@ -512,7 +502,7 @@ static inline void __flush_tlb_one_kernel(unsigned long 
addr)
 * kernel address space and for its usermode counterpart, but it does
 * not flush it for other address spaces.
 */
-   __flush_tlb_one_user(addr);
+   __flush_tlb_range(addr, addr + PAGE_SIZE, PAGE_SHIFT);
 
if (!static_cpu_has(X86_FEATURE_PTI))
return;
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5520a04c

[RFC PATCH v2 3/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown

2019-08-24 Thread Nadav Amit
When a shootdown is initiated, the initiating CPU has cycles to burn as
it waits for the responding CPUs to receive the IPI and acknowledge it.
In these cycles it is better to flush the user page-tables using
INVPCID, instead of deferring the TLB flush.

The best way to figure out whether there are cycles to burn is arguably
to expose from the SMP layer when an acknowledgment is received.
However, this would break some abstractions.

Instead, use a simpler solution: the initiating CPU of a TLB shootdown
would not defer PTI flushes. It is not always a win, relatively to
deferring user page-table flushes, but it prevents performance
regression.

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c   | 10 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index da56aa3ccd07..066b3804f876 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -573,6 +573,7 @@ struct flush_tlb_info {
unsigned intinitiating_cpu;
u8  stride_shift;
u8  freed_tables;
+   u8  shootdown;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 31260c55d597..ba50430275d4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -592,8 +592,13 @@ static void flush_tlb_user_pt_range(u16 asid, const struct 
flush_tlb_info *f)
 
/*
 * We can defer flushes as long as page-tables were not freed.
+*
+* However, if there is a shootdown the initiating CPU has cycles to
+* spare, while it waits for the other cores to respond. In this case,
+* deferring the flushing can cause overheads, so avoid it.
 */
-   if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables) {
+   if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables &&
+   (!f->shootdown || f->initiating_cpu != smp_processor_id())) {
flush_user_tlb_deferred(asid, start, end, stride_shift);
return;
}
@@ -861,6 +866,7 @@ static struct flush_tlb_info *get_flush_tlb_info(struct 
mm_struct *mm,
info->freed_tables  = freed_tables;
info->new_tlb_gen   = new_tlb_gen;
info->initiating_cpu= smp_processor_id();
+   info->shootdown = false;
 
return info;
 }
@@ -903,6 +909,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long 
start,
 * flush_tlb_func_local() directly in this case.
 */
if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+   info->shootdown = true;
flush_tlb_multi(mm_cpumask(mm), info);
} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
lockdep_assert_irqs_enabled();
@@ -970,6 +977,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
*batch)
 * flush_tlb_func_local() directly in this case.
 */
if (cpumask_any_but(>cpumask, cpu) < nr_cpu_ids) {
+   info->shootdown = true;
flush_tlb_multi(>cpumask, info);
} else if (cpumask_test_cpu(cpu, >cpumask)) {
lockdep_assert_irqs_enabled();
-- 
2.17.1



[RFC PATCH v2 2/3] x86/mm/tlb: Defer PTI flushes

2019-08-24 Thread Nadav Amit
INVPCID is considerably slower than INVLPG of a single PTE. Using it to
flush the user page-tables when PTI is enabled therefore introduces
significant overhead.

Instead, unless page-tables are released, it is possible to defer the
flushing of the user page-tables until the time the code returns to
userspace. These page tables are not in use, so deferring them is not a
security hazard. When CR3 is loaded, as part of returning to userspace,
use INVLPG to flush the relevant PTEs. Use LFENCE to prevent speculative
executions that skip INVLPG.

There are some caveats, which sometime require a full TLB flush of the
user page-tables. There are some (uncommon) code-paths that reload CR3
in which there is not stack. If a context-switch happens and there are
pending flushes, tracking which TLB flushes are later needed is
complicated and expensive. If there are multiple TLB flushes of
different ranges before the kernel returns to userspace, the overhead of
tracking them can exceed the benefit.

In these cases, perform a full TLB flush. It is possible to avoid them
in some cases, but the benefit in doing so is questionable.

Signed-off-by: Nadav Amit 
---
 arch/x86/entry/calling.h| 52 ++--
 arch/x86/include/asm/tlbflush.h | 30 +++---
 arch/x86/kernel/asm-offsets.c   |  3 ++
 arch/x86/mm/tlb.c   | 70 +
 4 files changed, 147 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 515c0ceeb4a3..a4d46416853d 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
 
@@ -205,7 +206,16 @@ For 32-bit we have the following conventions - kernel is 
built with
 #define THIS_CPU_user_pcid_flush_mask   \
PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
 
-.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+#define THIS_CPU_user_flush_start  \
+   PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_start
+
+#define THIS_CPU_user_flush_end\
+   PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_end
+
+#define THIS_CPU_user_flush_stride_shift   \
+   PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_stride_shift
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req scratch_reg2:req has_stack:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
mov %cr3, \scratch_reg
 
@@ -221,9 +231,41 @@ For 32-bit we have the following conventions - kernel is 
built with
 
/* Flush needed, clear the bit */
btr \scratch_reg, THIS_CPU_user_pcid_flush_mask
+.if \has_stack
+   cmpq$(TLB_FLUSH_ALL), THIS_CPU_user_flush_end
+   jnz .Lpartial_flush_\@
+.Ldo_full_flush_\@:
+.endif
movq\scratch_reg2, \scratch_reg
jmp .Lwrcr3_pcid_\@
-
+.if \has_stack
+.Lpartial_flush_\@:
+   /* Prepare CR3 with PGD of user, and no flush set */
+   orq $(PTI_USER_PGTABLE_AND_PCID_MASK), \scratch_reg2
+   SET_NOFLUSH_BIT \scratch_reg2
+   pushq   %rsi
+   pushq   %rbx
+   pushq   %rcx
+   movbTHIS_CPU_user_flush_stride_shift, %cl
+   movq$1, %rbx
+   shl %cl, %rbx
+   movqTHIS_CPU_user_flush_start, %rsi
+   movqTHIS_CPU_user_flush_end, %rcx
+   /* Load the new cr3 and flush */
+   mov \scratch_reg2, %cr3
+.Lflush_loop_\@:
+   invlpg  (%rsi)
+   addq%rbx, %rsi
+   cmpq%rsi, %rcx
+   ja  .Lflush_loop_\@
+   /* Prevent speculatively skipping flushes */
+   lfence
+
+   popq%rcx
+   popq%rbx
+   popq%rsi
+   jmp .Lend_\@
+.endif
 .Lnoflush_\@:
movq\scratch_reg2, \scratch_reg
SET_NOFLUSH_BIT \scratch_reg
@@ -239,9 +281,13 @@ For 32-bit we have the following conventions - kernel is 
built with
 .Lend_\@:
 .endm
 
+.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+   SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax 
has_stack=0
+.endm
+
 .macro SWITCH_TO_USER_CR3_STACKscratch_reg:req
pushq   %rax
-   SWITCH_TO_USER_CR3_NOSTACK scratch_reg=\scratch_reg scratch_reg2=%rax
+   SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax 
has_stack=1
popq%rax
 .endm
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 421bc82504e2..da56aa3ccd07 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -2,6 +2,10 @@
 #ifndef _ASM_X86_TLBFLUSH_H
 #define _ASM_X86_TLBFLUSH_H
 
+#define TLB_FLUSH_ALL  -1UL
+
+#ifndef __ASSEMBLY__
+
 #include 
 #include 
 
@@ -222,6 +226,10 @@ struct tlb_state {
 * context 0.
 */
struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
+
+   unsigned long user_flush_start;
+   unsigned long user_flush_end;
+   unsigned long user_flush_stride_shift;
 };
 DECLARE_PER_CPU_ALIGNED(struct tlb

[RFC PATCH v2 0/3] x86/mm/tlb: Defer TLB flushes with PTI

2019-08-24 Thread Nadav Amit
INVPCID is considerably slower than INVLPG of a single PTE, but it is
currently used to flush PTEs in the user page-table when PTI is used.

Instead, it is possible to defer TLB flushes until after the user
page-tables are loaded. Preventing speculation over the TLB flushes
should keep the whole thing safe. In some cases, deferring TLB flushes
in such a way can result in more full TLB flushes, but arguably this
behavior is oftentimes beneficial.

These patches are based and evaluated on top of the concurrent
TLB-flushes v4 patch-set.

I will provide more results later, but it might be easier to look at the
time an isolated TLB flush takes. These numbers are from skylake,
showing the number of cycles that running madvise(DONTNEED) which
results in local TLB flushes takes:

n_pages concurrent  +deferred-pti   change
--- --  -   --
 1  21191986-6.7%
 10 67915417 -20%

Please let me know if I missed something that affects security or
performance.

[ Yes, I know there is another pending RFC for async TLB flushes, but I
  think it might be easier to merge this one first ]

RFC v1 -> RFC v2:
  * Wrong patches were sent before

Nadav Amit (3):
  x86/mm/tlb: Change __flush_tlb_one_user interface
  x86/mm/tlb: Defer PTI flushes
  x86/mm/tlb: Avoid deferring PTI flushes on shootdown

 arch/x86/entry/calling.h  |  52 +++-
 arch/x86/include/asm/paravirt.h   |   5 +-
 arch/x86/include/asm/paravirt_types.h |   3 +-
 arch/x86/include/asm/tlbflush.h   |  55 +++-
 arch/x86/kernel/asm-offsets.c |   3 +
 arch/x86/kernel/paravirt.c|   7 +-
 arch/x86/mm/tlb.c | 117 --
 arch/x86/xen/mmu_pv.c |  21 +++--
 8 files changed, 218 insertions(+), 45 deletions(-)

-- 
2.17.1



Re: [RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI

2019-08-24 Thread Nadav Amit
Sorry, I made a mistake and included the wrong patches. I will send
RFC v2 in few minutes.


> On Aug 23, 2019, at 3:46 PM, Nadav Amit  wrote:
> 
> INVPCID is considerably slower than INVLPG of a single PTE, but it is
> currently used to flush PTEs in the user page-table when PTI is used.
> 
> Instead, it is possible to defer TLB flushes until after the user
> page-tables are loaded. Preventing speculation over the TLB flushes
> should keep the whole thing safe. In some cases, deferring TLB flushes
> in such a way can result in more full TLB flushes, but arguably this
> behavior is oftentimes beneficial.
> 
> These patches are based and evaluated on top of the concurrent
> TLB-flushes v4 patch-set.
> 
> I will provide more results later, but it might be easier to look at the
> time an isolated TLB flush takes. These numbers are from skylake,
> showing the number of cycles that running madvise(DONTNEED) which
> results in local TLB flushes takes:
> 
> n_pages   concurrent  +deferred-pti   change
> ---   --  -   --
> 1 21191986-6.7%
> 1067915417 -20%
> 
> Please let me know if I missed something that affects security or
> performance.
> 
> [ Yes, I know there is another pending RFC for async TLB flushes, but I
>  think it might be easier to merge this one first ]
> 
> Nadav Amit (3):
>  x86/mm/tlb: Defer PTI flushes
>  x86/mm/tlb: Avoid deferring PTI flushes on shootdown
>  x86/mm/tlb: Use lockdep irq assertions
> 
> arch/x86/entry/calling.h| 52 +++--
> arch/x86/include/asm/tlbflush.h | 31 ++--
> arch/x86/kernel/asm-offsets.c   |  3 ++
> arch/x86/mm/tlb.c   | 83 +++--
> 4 files changed, 158 insertions(+), 11 deletions(-)
> 
> -- 
> 2.17.1




[RFC PATCH 1/3] x86/mm/tlb: Defer PTI flushes

2019-08-24 Thread Nadav Amit
INVPCID is considerably slower than INVLPG of a single PTE. Using it to
flush the user page-tables when PTI is enabled therefore introduces
significant overhead.

Instead, unless page-tables are released, it is possible to defer the
flushing of the user page-tables until the time the code returns to
userspace. These page tables are not in use, so deferring them is not a
security hazard. When CR3 is loaded, as part of returning to userspace,
use INVLPG to flush the relevant PTEs. Use LFENCE to prevent speculative
executions that skip INVLPG.

There are some caveats, which sometime require a full TLB flush of the
user page-tables. There are some (uncommon) code-paths that reload CR3
in which there is not stack. If a context-switch happens and there are
pending flushes, tracking which TLB flushes are later needed is
complicated and expensive. If there are multiple TLB flushes of
different ranges before the kernel returns to userspace, the overhead of
tracking them can exceed the benefit.

In these cases, perform a full TLB flush. It is possible to avoid them
in some cases, but the benefit in doing so is questionable.

Signed-off-by: Nadav Amit 
---
 arch/x86/entry/calling.h| 52 ++--
 arch/x86/include/asm/tlbflush.h | 30 +++---
 arch/x86/kernel/asm-offsets.c   |  3 ++
 arch/x86/mm/tlb.c   | 70 +
 4 files changed, 147 insertions(+), 8 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 515c0ceeb4a3..a4d46416853d 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /*
 
@@ -205,7 +206,16 @@ For 32-bit we have the following conventions - kernel is 
built with
 #define THIS_CPU_user_pcid_flush_mask   \
PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
 
-.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+#define THIS_CPU_user_flush_start  \
+   PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_start
+
+#define THIS_CPU_user_flush_end\
+   PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_end
+
+#define THIS_CPU_user_flush_stride_shift   \
+   PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_flush_stride_shift
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req scratch_reg2:req has_stack:req
ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
mov %cr3, \scratch_reg
 
@@ -221,9 +231,41 @@ For 32-bit we have the following conventions - kernel is 
built with
 
/* Flush needed, clear the bit */
btr \scratch_reg, THIS_CPU_user_pcid_flush_mask
+.if \has_stack
+   cmpq$(TLB_FLUSH_ALL), THIS_CPU_user_flush_end
+   jnz .Lpartial_flush_\@
+.Ldo_full_flush_\@:
+.endif
movq\scratch_reg2, \scratch_reg
jmp .Lwrcr3_pcid_\@
-
+.if \has_stack
+.Lpartial_flush_\@:
+   /* Prepare CR3 with PGD of user, and no flush set */
+   orq $(PTI_USER_PGTABLE_AND_PCID_MASK), \scratch_reg2
+   SET_NOFLUSH_BIT \scratch_reg2
+   pushq   %rsi
+   pushq   %rbx
+   pushq   %rcx
+   movbTHIS_CPU_user_flush_stride_shift, %cl
+   movq$1, %rbx
+   shl %cl, %rbx
+   movqTHIS_CPU_user_flush_start, %rsi
+   movqTHIS_CPU_user_flush_end, %rcx
+   /* Load the new cr3 and flush */
+   mov \scratch_reg2, %cr3
+.Lflush_loop_\@:
+   invlpg  (%rsi)
+   addq%rbx, %rsi
+   cmpq%rsi, %rcx
+   ja  .Lflush_loop_\@
+   /* Prevent speculatively skipping flushes */
+   lfence
+
+   popq%rcx
+   popq%rbx
+   popq%rsi
+   jmp .Lend_\@
+.endif
 .Lnoflush_\@:
movq\scratch_reg2, \scratch_reg
SET_NOFLUSH_BIT \scratch_reg
@@ -239,9 +281,13 @@ For 32-bit we have the following conventions - kernel is 
built with
 .Lend_\@:
 .endm
 
+.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+   SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax 
has_stack=0
+.endm
+
 .macro SWITCH_TO_USER_CR3_STACKscratch_reg:req
pushq   %rax
-   SWITCH_TO_USER_CR3_NOSTACK scratch_reg=\scratch_reg scratch_reg2=%rax
+   SWITCH_TO_USER_CR3 scratch_reg=\scratch_reg scratch_reg2=%rax 
has_stack=1
popq%rax
 .endm
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 421bc82504e2..da56aa3ccd07 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -2,6 +2,10 @@
 #ifndef _ASM_X86_TLBFLUSH_H
 #define _ASM_X86_TLBFLUSH_H
 
+#define TLB_FLUSH_ALL  -1UL
+
+#ifndef __ASSEMBLY__
+
 #include 
 #include 
 
@@ -222,6 +226,10 @@ struct tlb_state {
 * context 0.
 */
struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
+
+   unsigned long user_flush_start;
+   unsigned long user_flush_end;
+   unsigned long user_flush_stride_shift;
 };
 DECLARE_PER_CPU_ALIGNED(struct tlb

[RFC PATCH 2/3] x86/mm/tlb: Avoid deferring PTI flushes on shootdown

2019-08-24 Thread Nadav Amit
When a shootdown is initiated, the initiating CPU has cycles to burn as
it waits for the responding CPUs to receive the IPI and acknowledge it.
In these cycles it is better to flush the user page-tables using
INVPCID, instead of deferring the TLB flush.

The best way to figure out whether there are cycles to burn is arguably
to expose from the SMP layer when an acknowledgment is received.
However, this would break some abstractions.

Instead, use a simpler solution: the initiating CPU of a TLB shootdown
would not defer PTI flushes. It is not always a win, relatively to
deferring user page-table flushes, but it prevents performance
regression.

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c   | 10 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index da56aa3ccd07..066b3804f876 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -573,6 +573,7 @@ struct flush_tlb_info {
unsigned intinitiating_cpu;
u8  stride_shift;
u8  freed_tables;
+   u8  shootdown;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 31260c55d597..ba50430275d4 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -592,8 +592,13 @@ static void flush_tlb_user_pt_range(u16 asid, const struct 
flush_tlb_info *f)
 
/*
 * We can defer flushes as long as page-tables were not freed.
+*
+* However, if there is a shootdown the initiating CPU has cycles to
+* spare, while it waits for the other cores to respond. In this case,
+* deferring the flushing can cause overheads, so avoid it.
 */
-   if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables) {
+   if (IS_ENABLED(CONFIG_X86_64) && !f->freed_tables &&
+   (!f->shootdown || f->initiating_cpu != smp_processor_id())) {
flush_user_tlb_deferred(asid, start, end, stride_shift);
return;
}
@@ -861,6 +866,7 @@ static struct flush_tlb_info *get_flush_tlb_info(struct 
mm_struct *mm,
info->freed_tables  = freed_tables;
info->new_tlb_gen   = new_tlb_gen;
info->initiating_cpu= smp_processor_id();
+   info->shootdown = false;
 
return info;
 }
@@ -903,6 +909,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long 
start,
 * flush_tlb_func_local() directly in this case.
 */
if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) {
+   info->shootdown = true;
flush_tlb_multi(mm_cpumask(mm), info);
} else if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
lockdep_assert_irqs_enabled();
@@ -970,6 +977,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
*batch)
 * flush_tlb_func_local() directly in this case.
 */
if (cpumask_any_but(>cpumask, cpu) < nr_cpu_ids) {
+   info->shootdown = true;
flush_tlb_multi(>cpumask, info);
} else if (cpumask_test_cpu(cpu, >cpumask)) {
lockdep_assert_irqs_enabled();
-- 
2.17.1



[RFC PATCH 0/3] x86/mm/tlb: Defer TLB flushes with PTI

2019-08-24 Thread Nadav Amit
INVPCID is considerably slower than INVLPG of a single PTE, but it is
currently used to flush PTEs in the user page-table when PTI is used.

Instead, it is possible to defer TLB flushes until after the user
page-tables are loaded. Preventing speculation over the TLB flushes
should keep the whole thing safe. In some cases, deferring TLB flushes
in such a way can result in more full TLB flushes, but arguably this
behavior is oftentimes beneficial.

These patches are based and evaluated on top of the concurrent
TLB-flushes v4 patch-set.

I will provide more results later, but it might be easier to look at the
time an isolated TLB flush takes. These numbers are from skylake,
showing the number of cycles that running madvise(DONTNEED) which
results in local TLB flushes takes:

n_pages concurrent  +deferred-pti   change
--- --  -   --
 1  21191986-6.7%
 10 67915417 -20%

Please let me know if I missed something that affects security or
performance.

[ Yes, I know there is another pending RFC for async TLB flushes, but I
  think it might be easier to merge this one first ]

Nadav Amit (3):
  x86/mm/tlb: Defer PTI flushes
  x86/mm/tlb: Avoid deferring PTI flushes on shootdown
  x86/mm/tlb: Use lockdep irq assertions

 arch/x86/entry/calling.h| 52 +++--
 arch/x86/include/asm/tlbflush.h | 31 ++--
 arch/x86/kernel/asm-offsets.c   |  3 ++
 arch/x86/mm/tlb.c   | 83 +++--
 4 files changed, 158 insertions(+), 11 deletions(-)

-- 
2.17.1



[RFC PATCH 3/3] x86/mm/tlb: Use lockdep irq assertions

2019-08-24 Thread Nadav Amit
The assertions that check whether IRQs are disabled depend currently on
different debug features. Use instead lockdep_assert_irqs_disabled(),
which is standard, enabled by the same debug feature,  and provides more
information upon failures.

Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index ba50430275d4..6f4ce02e2c5b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -293,8 +293,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 */
 
/* We don't want flush_tlb_func() to run concurrently with us. */
-   if (IS_ENABLED(CONFIG_PROVE_LOCKING))
-   WARN_ON_ONCE(!irqs_disabled());
+   lockdep_assert_irqs_disabled();
 
/*
 * Verify that CR3 is what we think it is.  This will catch
@@ -643,7 +642,7 @@ static void flush_tlb_func(void *info)
unsigned long nr_invalidate = 0;
 
/* This code cannot presently handle being reentered. */
-   VM_WARN_ON(!irqs_disabled());
+   lockdep_assert_irqs_disabled();
 
if (!local) {
inc_irq_stat(irq_tlb_count);
-- 
2.17.1



[PATCH 3/7] x86/percpu: Use C for percpu accesses when possible

2019-08-24 Thread Nadav Amit
The percpu code mostly uses inline assembly. Using segment qualifiers
allows to use C code instead, which enables the compiler to perform
various optimizations (e.g., CSE). For example, in __schedule() the
following two instructions:

  mov%gs:0x7e5f1eff(%rip),%edx# 0x10350 
  movslq %edx,%rdx

Turn with this patch into:

  movslq %gs:0x7e5f2e6e(%rip),%rax# 0x10350 

In addition, operations that have no guarantee against concurrent
interrupts or preemption, such as __this_cpu_cmpxchg() can be further
optimized by the compiler when they are implemented in C, for example
in call_timer_fn().

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/percpu.h | 115 +++---
 1 file changed, 105 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 1fe348884477..13987f9bc82f 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -439,13 +439,88 @@ do {  
\
  */
 #define this_cpu_read_stable(var)  percpu_stable_op("mov", var)
 
+#if USE_X86_SEG_SUPPORT
+
+#define __raw_cpu_read(qual, pcp)  \
+({ \
+   *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp));   \
+})
+
+#define __raw_cpu_write(qual, pcp, val)
\
+   do {\
+   *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \
+   } while (0)
+
+/*
+ * Performance-wise, C operations are only more efficient than their inline
+ * assembly counterparts for non-volatile variables (__this_*) and for volatile
+ * loads and stores.
+ *
+ * Since we do not use assembly, we are free to define 64-bit operations
+ * on 32-bit architecture
+ */
+#define __raw_cpu_add(pcp, val)do { __my_cpu_var(pcp) += 
(val); } while (0)
+#define __raw_cpu_and(pcp, val)do { __my_cpu_var(pcp) &= 
(val); } while (0)
+#define __raw_cpu_or(pcp, val) do { __my_cpu_var(pcp) |= (val); } 
while (0)
+#define __raw_cpu_add_return(pcp, val) ({ __my_cpu_var(pcp) += (val); })
+
+#define __raw_cpu_xchg(pcp, val)   \
+({ \
+   typeof(pcp) pxo_ret__ = __my_cpu_var(pcp);  \
+   \
+   __my_cpu_var(pcp) = (val);  \
+   pxo_ret__;  \
+})
+
+#define __raw_cpu_cmpxchg(pcp, oval, nval) \
+({ \
+   __my_cpu_type(pcp) *__p = __my_cpu_ptr(&(pcp)); \
+   \
+   typeof(pcp) __ret = *__p;   \
+   \
+   if (__ret == (oval))\
+   *__p = nval;\
+   __ret;  \
+})
+
+#define raw_cpu_read_1(pcp)__raw_cpu_read(, pcp)
+#define raw_cpu_read_2(pcp)__raw_cpu_read(, pcp)
+#define raw_cpu_read_4(pcp)__raw_cpu_read(, pcp)
+#define raw_cpu_write_1(pcp, val)  __raw_cpu_write(, pcp, val)
+#define raw_cpu_write_2(pcp, val)  __raw_cpu_write(, pcp, val)
+#define raw_cpu_write_4(pcp, val)  __raw_cpu_write(, pcp, val)
+#define raw_cpu_add_1(pcp, val)__raw_cpu_add(pcp, val)
+#define raw_cpu_add_2(pcp, val)__raw_cpu_add(pcp, val)
+#define raw_cpu_add_4(pcp, val)__raw_cpu_add(pcp, val)
+#define raw_cpu_and_1(pcp, val)__raw_cpu_and(pcp, val)
+#define raw_cpu_and_2(pcp, val)__raw_cpu_and(pcp, val)
+#define raw_cpu_and_4(pcp, val)__raw_cpu_and(pcp, val)
+#define raw_cpu_or_1(pcp, val) __raw_cpu_or(pcp, val)
+#define raw_cpu_or_2(pcp, val) __raw_cpu_or(pcp, val)
+#define raw_cpu_or_4(pcp, val) __raw_cpu_or(pcp, val)
+#define raw_cpu_xchg_1(pcp, val)   __raw_cpu_xchg(pcp, val)
+#define raw_cpu_xchg_2(pcp, val)   __raw_cpu_xchg(pcp, val)
+#define raw_cpu_xchg_4(pcp, val)   __raw_cpu_xchg(pcp, val)
+#define raw_cpu_add_return_1(pcp, val) __raw_cpu_add_return(pcp, val)
+#define raw_cpu_add_return_2(pcp, val) __raw_cpu_add_return(pcp, val)
+#define raw_cpu_add_return_4(pcp, val) __raw_cpu_add_return(pcp, val)
+#define raw_cpu_add_return_8(pcp, val) __raw_cpu_add_return(pcp, val)
+#define raw_cpu_cmpxchg_1(pcp, oval, nval) __raw_cp

[PATCH 5/7] percpu: Assume preemption is disabled on per_cpu_ptr()

2019-08-24 Thread Nadav Amit
When per_cpu_ptr() is used, the caller should have preemption disabled,
as otherwise the pointer is meaningless. If the user wants an "unstable"
pointer he should call raw_cpu_ptr().

Add an assertion to check that indeed preemption is disabled, and
distinguish between the two cases to allow further, per-arch
optimizations.

Signed-off-by: Nadav Amit 
---
 include/asm-generic/percpu.h | 12 
 include/linux/percpu-defs.h  | 33 -
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index c2de013b2cf4..7853605f4210 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -36,6 +36,14 @@ extern unsigned long __per_cpu_offset[NR_CPUS];
 #define my_cpu_offset __my_cpu_offset
 #endif
 
+/*
+ * Determine the offset of the current active processor when preemption is
+ * disabled. Can be overriden by arch code.
+ */
+#ifndef __raw_my_cpu_offset
+#define __raw_my_cpu_offset __my_cpu_offset
+#endif
+
 /*
  * Arch may define arch_raw_cpu_ptr() to provide more efficient address
  * translations for raw_cpu_ptr().
@@ -44,6 +52,10 @@ extern unsigned long __per_cpu_offset[NR_CPUS];
 #define arch_raw_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset)
 #endif
 
+#ifndef arch_raw_cpu_ptr_preemptable
+#define arch_raw_cpu_ptr_preemptable(ptr) SHIFT_PERCPU_PTR(ptr, 
__raw_my_cpu_offset)
+#endif
+
 #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
 extern void setup_per_cpu_areas(void);
 #endif
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index a6fabd865211..13afca8a37e7 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -237,20 +237,51 @@ do {  
\
SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); \
 })
 
+#ifndef arch_raw_cpu_ptr_preemption_disabled
+#define arch_raw_cpu_ptr_preemption_disabled(ptr)  \
+   arch_raw_cpu_ptr(ptr)
+#endif
+
+#define raw_cpu_ptr_preemption_disabled(ptr)   \
+({ \
+   __verify_pcpu_ptr(ptr); \
+   arch_raw_cpu_ptr_preemption_disabled(ptr);  \
+})
+
+/*
+ * If preemption is enabled, we need to read the pointer atomically on
+ * raw_cpu_ptr(). However if it is disabled, we can use the
+ * raw_cpu_ptr_nopreempt(), which is potentially more efficient. Similarly, we
+ * can use the preemption-disabled version if the kernel is non-preemptable or
+ * if voluntary preemption is used.
+ */
+#ifdef CONFIG_PREEMPT
+
 #define raw_cpu_ptr(ptr)   \
 ({ \
__verify_pcpu_ptr(ptr); \
arch_raw_cpu_ptr(ptr);  \
 })
 
+#else
+
+#define raw_cpu_ptr(ptr)   raw_cpu_ptr_preemption_disabled(ptr)
+
+#endif
+
 #ifdef CONFIG_DEBUG_PREEMPT
+/*
+ * Unlike other this_cpu_* operations, this_cpu_ptr() requires that preemption
+ * will be disabled. In contrast, raw_cpu_ptr() does not require that.
+ */
 #define this_cpu_ptr(ptr)  \
 ({ \
+   __this_cpu_preempt_check("ptr");\
__verify_pcpu_ptr(ptr); \
SHIFT_PERCPU_PTR(ptr, my_cpu_offset);   \
 })
 #else
-#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
+#define this_cpu_ptr(ptr) raw_cpu_ptr_preemption_disabled(ptr)
 #endif
 
 #else  /* CONFIG_SMP */
-- 
2.17.1



[PATCH 1/7] compiler: Report x86 segment support

2019-08-24 Thread Nadav Amit
GCC v6+ supports x86 segment qualifiers (__seg_gs and __seg_fs). Define
COMPILER_HAS_X86_SEG_SUPPORT when it is supported.

Signed-off-by: Nadav Amit 
---
 include/linux/compiler-gcc.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index d7ee4c6bad48..5967590a18c6 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -149,6 +149,10 @@
 #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1
 #endif
 
+#if GCC_VERSION >= 6
+#define COMPILER_HAS_X86_SEG_SUPPORT 1
+#endif
+
 /*
  * Turn individual warnings and errors on and off locally, depending
  * on version.
-- 
2.17.1



[PATCH 4/7] x86: Fix possible caching of current_task

2019-08-24 Thread Nadav Amit
this_cpu_read_stable() is allowed and supposed to cache and return the
same value, specifically for current_task. It actually does not cache
current_task very well, which hinders possible invalid caching when the
task is switched in __switch_to().

Fix the possible caching by avoiding the use of current in
__switch_to()'s dynamic extent.

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/fpu/internal.h|  7 ---
 arch/x86/include/asm/resctrl_sched.h   | 14 +++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 ++--
 arch/x86/kernel/process_32.c   |  4 ++--
 arch/x86/kernel/process_64.c   |  4 ++--
 5 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 4c95c365058a..b537788600fe 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -588,9 +588,10 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, 
int cpu)
 
 /*
  * Load PKRU from the FPU context if available. Delay loading of the
- * complete FPU state until the return to userland.
+ * complete FPU state until the return to userland. Avoid reading current 
during
+ * switch.
  */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct task_struct *task, struct fpu 
*new_fpu)
 {
u32 pkru_val = init_pkru_value;
struct pkru_state *pk;
@@ -598,7 +599,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
if (!static_cpu_has(X86_FEATURE_FPU))
return;
 
-   set_thread_flag(TIF_NEED_FPU_LOAD);
+   set_ti_thread_flag(task_thread_info(task), TIF_NEED_FPU_LOAD);
 
if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
return;
diff --git a/arch/x86/include/asm/resctrl_sched.h 
b/arch/x86/include/asm/resctrl_sched.h
index f6b7fe2833cc..9a00d9df9d02 100644
--- a/arch/x86/include/asm/resctrl_sched.h
+++ b/arch/x86/include/asm/resctrl_sched.h
@@ -51,7 +51,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
  *   simple as possible.
  * Must be called with preemption disabled.
  */
-static void __resctrl_sched_in(void)
+static void __resctrl_sched_in(struct task_struct *task)
 {
struct resctrl_pqr_state *state = this_cpu_ptr(_state);
u32 closid = state->default_closid;
@@ -62,13 +62,13 @@ static void __resctrl_sched_in(void)
 * Else use the closid/rmid assigned to this cpu.
 */
if (static_branch_likely(_alloc_enable_key)) {
-   if (current->closid)
+   if (task->closid)
closid = current->closid;
}
 
if (static_branch_likely(_mon_enable_key)) {
-   if (current->rmid)
-   rmid = current->rmid;
+   if (task->rmid)
+   rmid = task->rmid;
}
 
if (closid != state->cur_closid || rmid != state->cur_rmid) {
@@ -78,15 +78,15 @@ static void __resctrl_sched_in(void)
}
 }
 
-static inline void resctrl_sched_in(void)
+static inline void resctrl_sched_in(struct task_struct *task)
 {
if (static_branch_likely(_enable_key))
-   __resctrl_sched_in();
+   __resctrl_sched_in(task);
 }
 
 #else
 
-static inline void resctrl_sched_in(void) {}
+static inline void resctrl_sched_in(struct task_struct *task) {}
 
 #endif /* CONFIG_X86_CPU_RESCTRL */
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c 
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index a46dee8e78db..5fcf56cbf438 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -311,7 +311,7 @@ static void update_cpu_closid_rmid(void *info)
 * executing task might have its own closid selected. Just reuse
 * the context switch code.
 */
-   resctrl_sched_in();
+   resctrl_sched_in(current);
 }
 
 /*
@@ -536,7 +536,7 @@ static void move_myself(struct callback_head *head)
 
preempt_disable();
/* update PQR_ASSOC MSR to make resource group go into effect */
-   resctrl_sched_in();
+   resctrl_sched_in(current);
preempt_enable();
 
kfree(callback);
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index b8ceec4974fe..699a4c95ab13 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -292,10 +292,10 @@ __switch_to(struct task_struct *prev_p, struct 
task_struct *next_p)
 
this_cpu_write(current_task, next_p);
 
-   switch_fpu_finish(next_fpu);
+   switch_fpu_finish(next_p, next_fpu);
 
/* Load the Intel cache allocation PQR MSR. */
-   resctrl_sched_in();
+   resctrl_sched_in(next_p);
 
return prev_p;
 }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index af64519b2695..bb811808936d 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -565,7 +565,7 @@ __switch

[PATCH 7/7] x86/current: Aggressive caching of current

2019-08-24 Thread Nadav Amit
The current_task is supposed to be constant in each thread and therefore
does not need to be reread. There is already an attempt to cache it
using inline assembly, using this_cpu_read_stable(), which hides the
dependency on the read memory address.

However, this caching is not working very well. For example,
sync_mm_rss() still reads current_task twice for no reason.

Allow more aggressive caching by aliasing current_task to
into a constant const_current_task and reading from the constant copy.
Doing so requires the compiler to support x86 segment qualifiers.
Hide const_current_task in a different compilation unit to avoid the
compiler from assuming that the value is constant during compilation.

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/current.h | 30 ++
 arch/x86/kernel/cpu/Makefile   |  1 +
 arch/x86/kernel/cpu/common.c   |  7 +--
 arch/x86/kernel/cpu/current.c  | 16 
 include/linux/compiler.h   |  2 +-
 5 files changed, 49 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/current.c

diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h
index 3e204e6140b5..7f093e81a647 100644
--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -10,11 +10,41 @@ struct task_struct;
 
 DECLARE_PER_CPU(struct task_struct *, current_task);
 
+#if USE_X86_SEG_SUPPORT
+
+/*
+ * Hold a constant alias for current_task, which would allow to avoid caching 
of
+ * current task.
+ *
+ * We must mark const_current_task with the segment qualifiers, as otherwise 
gcc
+ * would do redundant reads of const_current_task.
+ */
+DECLARE_PER_CPU(struct task_struct * const __percpu_seg_override, 
const_current_task);
+
+static __always_inline struct task_struct *get_current(void)
+{
+
+   /*
+* GCC is missing functionality of removing segment qualifiers, which
+* messes with per-cpu infrastructure that holds local copies. Use
+* __raw_cpu_read to avoid holding any copy.
+*/
+   return __raw_cpu_read(, const_current_task);
+}
+
+#else /* USE_X86_SEG_SUPPORT */
+
+/*
+ * Without segment qualifier support, the per-cpu infrastrucutre is not
+ * suitable for reading constants, so use this_cpu_read_stable() in this case.
+ */
 static __always_inline struct task_struct *get_current(void)
 {
return this_cpu_read_stable(current_task);
 }
 
+#endif /* USE_X86_SEG_SUPPORT */
+
 #define current get_current()
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index d7a1e5a9331c..d816f03a37d7 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -19,6 +19,7 @@ CFLAGS_common.o   := $(nostackp)
 
 obj-y  := cacheinfo.o scattered.o topology.o
 obj-y  += common.o
+obj-y  += current.o
 obj-y  += rdrand.o
 obj-y  += match.o
 obj-y  += bugs.o
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 5cc2d51cc25e..5f7c9ee57802 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1619,13 +1619,8 @@ DEFINE_PER_CPU_FIRST(struct fixed_percpu_data,
 EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data);
 
 /*
- * The following percpu variables are hot.  Align current_task to
- * cacheline size such that they fall in the same cacheline.
+ * The following percpu variables are hot.
  */
-DEFINE_PER_CPU(struct task_struct *, current_task) cacheline_aligned =
-   _task;
-EXPORT_PER_CPU_SYMBOL(current_task);
-
 DEFINE_PER_CPU(struct irq_stack *, hardirq_stack_ptr);
 DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 
diff --git a/arch/x86/kernel/cpu/current.c b/arch/x86/kernel/cpu/current.c
new file mode 100644
index ..3238c6e34984
--- /dev/null
+++ b/arch/x86/kernel/cpu/current.c
@@ -0,0 +1,16 @@
+#include 
+#include 
+
+/*
+ * Align current_task to cacheline size such that they fall in the same
+ * cacheline.
+ */
+DEFINE_PER_CPU(struct task_struct *, current_task) cacheline_aligned =
+   _task;
+EXPORT_PER_CPU_SYMBOL(current_task);
+
+#if USE_X86_SEG_SUPPORT
+DECLARE_PER_CPU(struct task_struct * const __percpu_seg_override, 
const_current_task)
+   __attribute__((alias("current_task")));
+EXPORT_PER_CPU_SYMBOL(const_current_task);
+#endif
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index f0fd5636fddb..1b6ee9ab6373 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -299,7 +299,7 @@ unsigned long read_word_at_a_time(const void *addr)
  */
 #define __ADDRESSABLE(sym) \
static void * __section(".discard.addressable") __used \
-   __PASTE(__addressable_##sym, __LINE__) = (void *)
+   __PASTE(__addressable_##sym, __LINE__) = (void 
*)(uintptr_t)
 
 /**
  * offset_to_ptr - convert a relative memory offset to an absolute pointer
-- 
2.17.1



[PATCH 0/7] x86/percpu: Use segment qualifiers

2019-08-24 Thread Nadav Amit
GCC 6+ supports segment qualifiers. Using them allows to implement
several optimizations:

1. Avoid unnecessary instructions when an operation is carried on
read/written per-cpu value, and instead allow the compiler to set
instructions that access per-cpu value directly.

2. Make this_cpu_ptr() more efficient and allow its value to be cached,
since preemption must be disabled when this_cpu_ptr() is used.

3. Provide better alternative for this_cpu_read_stable() that caches
values more efficiently using alias attribute to const variable.

4. Allow the compiler to perform other optimizations (e.g. CSE).

5. Use rip-relative addressing in per_cpu_read_stable(), which make it
PIE-ready.

"size" and Peter's compare do not seem to show the impact on code size
reduction correctly. Summing the code size according to nm on defconfig
shows a minor reduction from 11451310 to 11451310 (0.09%).

RFC->v1:
 * Fixing i386 build bug
 * Moving chunk to the right place [Peter]

Nadav Amit (7):
  compiler: Report x86 segment support
  x86/percpu: Use compiler segment prefix qualifier
  x86/percpu: Use C for percpu accesses when possible
  x86: Fix possible caching of current_task
  percpu: Assume preemption is disabled on per_cpu_ptr()
  x86/percpu: Optimized arch_raw_cpu_ptr()
  x86/current: Aggressive caching of current

 arch/x86/include/asm/current.h |  30 +++
 arch/x86/include/asm/fpu/internal.h|   7 +-
 arch/x86/include/asm/percpu.h  | 293 +++--
 arch/x86/include/asm/preempt.h |   3 +-
 arch/x86/include/asm/resctrl_sched.h   |  14 +-
 arch/x86/kernel/cpu/Makefile   |   1 +
 arch/x86/kernel/cpu/common.c   |   7 +-
 arch/x86/kernel/cpu/current.c  |  16 ++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |   4 +-
 arch/x86/kernel/process_32.c   |   4 +-
 arch/x86/kernel/process_64.c   |   4 +-
 include/asm-generic/percpu.h   |  12 +
 include/linux/compiler-gcc.h   |   4 +
 include/linux/compiler.h   |   2 +-
 include/linux/percpu-defs.h|  33 ++-
 15 files changed, 346 insertions(+), 88 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/current.c

-- 
2.17.1



[PATCH 2/7] x86/percpu: Use compiler segment prefix qualifier

2019-08-24 Thread Nadav Amit
Using a segment prefix qualifier is cleaner than using a segment prefix
in the inline assembly, and provides the compiler with more information,
telling it that __seg_gs:[addr] is different than [addr] when it
analyzes data dependencies. It also enables various optimizations that
will be implemented in the next patches.

Use segment prefix qualifiers when they are supported. Unfortunately,
gcc does not provide a way to remove segment qualifiers, which is needed
to use typeof() to create local instances of the per-cpu variable. For
this reason, do not use the segment qualifier for per-cpu variables, and
do casting using the segment qualifier instead.

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/percpu.h  | 153 ++---
 arch/x86/include/asm/preempt.h |   3 +-
 2 files changed, 104 insertions(+), 52 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 2278797c769d..1fe348884477 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -45,8 +45,33 @@
 #include 
 #include 
 
+#if defined(COMPILER_HAS_X86_SEG_SUPPORT) && IS_ENABLED(CONFIG_SMP)
+#define USE_X86_SEG_SUPPORT1
+#else
+#define USE_X86_SEG_SUPPORT0
+#endif
+
 #ifdef CONFIG_SMP
+
+#if USE_X86_SEG_SUPPORT
+
+#ifdef CONFIG_X86_64
+#define __percpu_seg_override  __seg_gs
+#else
+#define __percpu_seg_override  __seg_fs
+#endif
+
+#define __percpu_prefix""
+#define __force_percpu_prefix  "%%"__stringify(__percpu_seg)":"
+
+#else /* USE_X86_SEG_SUPPORT */
+
+#define __percpu_seg_override
 #define __percpu_prefix"%%"__stringify(__percpu_seg)":"
+#define __force_percpu_prefix  __percpu_prefix
+
+#endif /* USE_X86_SEG_SUPPORT */
+
 #define __my_cpu_offsetthis_cpu_read(this_cpu_off)
 
 /*
@@ -58,14 +83,21 @@
unsigned long tcp_ptr__;\
asm volatile("add " __percpu_arg(1) ", %0"  \
 : "=r" (tcp_ptr__) \
-: "m" (this_cpu_off), "0" (ptr));  \
+: "m" (__my_cpu_var(this_cpu_off)),\
+  "0" (ptr));  \
(typeof(*(ptr)) __kernel __force *)tcp_ptr__;   \
 })
-#else
+#else /* CONFIG_SMP */
+#define __percpu_seg_override
 #define __percpu_prefix""
-#endif
+#define __force_percpu_prefix  ""
+#endif /* CONFIG_SMP */
 
+#define __my_cpu_type(var) typeof(var) __percpu_seg_override
+#define __my_cpu_ptr(ptr)  (__my_cpu_type(*ptr) *)(uintptr_t)(ptr)
+#define __my_cpu_var(var)  (*__my_cpu_ptr())
 #define __percpu_arg(x)__percpu_prefix "%" #x
+#define __force_percpu_arg(x)  __force_percpu_prefix "%" #x
 
 /*
  * Initialized pointers to per-cpu variables needed for the boot
@@ -98,22 +130,22 @@ do {   
\
switch (sizeof(var)) {  \
case 1: \
asm qual (op "b %1,"__percpu_arg(0) \
-   : "+m" (var)\
+   : "+m" (__my_cpu_var(var))  \
: "qi" ((pto_T__)(val)));   \
break;  \
case 2: \
asm qual (op "w %1,"__percpu_arg(0) \
-   : "+m" (var)\
+   : "+m" (__my_cpu_var(var))  \
: "ri" ((pto_T__)(val)));   \
break;  \
case 4: \
asm qual (op "l %1,"__percpu_arg(0) \
-   : "+m" (var)\
+   : "+m" (__my_cpu_var(var))  \
: "ri" ((pto_T__)(val)));   \
break;  \
case 8: \
asm qual (op "q %1,"__percpu_arg(0) \
-   : "+m" (var)\
+   : "+m" (__my_cpu_var(var))  \
: "re" ((pto_T__)(val)));   \
break;  \
default: __bad_percpu_size();   \
@@ -138,71 +170,79 @@ do {  
\
switch (sizeof(var)) {  \
case 1: \
   

[PATCH 6/7] x86/percpu: Optimized arch_raw_cpu_ptr()

2019-08-24 Thread Nadav Amit
Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform
better optimizations, such as setting an appropriate base to compute
the address instead of an add instruction.

The benefit of this computation is relevant only when using compiler
segment qualifiers. It is inapplicable to use this method when the
address size is greater than the maximum operand size, as it is when
building vdso32.

Distinguish between the two cases in which preemption is disabled (as
happens when this_cpu_ptr() is used) and enabled (when raw_cpu_ptr() is
used).

This allows optimizations, for instance in rcu_dynticks_eqs_exit(),
the following code:

  mov$0x2bbc0,%rax
  add%gs:0x7ef07570(%rip),%rax  # 0x10358 
  lock xadd %edx,0xd8(%rax)

Turns with this patch into:

  mov%gs:0x7ef08aa5(%rip),%rax  # 0x10358 
  lock xadd %edx,0x2bc58(%rax)

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/percpu.h | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 13987f9bc82f..3d82df27d45c 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -73,20 +73,41 @@
 #endif /* USE_X86_SEG_SUPPORT */
 
 #define __my_cpu_offsetthis_cpu_read(this_cpu_off)
+#define __raw_my_cpu_offset__this_cpu_read(this_cpu_off)
+#define __my_cpu_ptr(ptr)  (__my_cpu_type(*ptr) *)(uintptr_t)(ptr)
 
+#if USE_X86_SEG_SUPPORT && !defined(BUILD_VDSO32) && defined(CONFIG_X86_64)
+/*
+ * Efficient implementation for cases in which the compiler supports C 
segments.
+ * Allows the compiler to perform additional optimizations that can save more
+ * instructions.
+ *
+ * This optimized version can only be used if the pointer size equals to native
+ * operand size, which does not happen when vdso32 is used.
+ */
+#define __arch_raw_cpu_ptr_qual(qual, ptr) \
+({ \
+   (qual typeof(*(ptr)) __kernel __force *)((uintptr_t)(ptr) + \
+   __my_cpu_offset);   \
+})
+#else /* USE_X86_SEG_SUPPORT && !defined(BUILD_VDSO32) && 
defined(CONFIG_X86_64) */
 /*
  * Compared to the generic __my_cpu_offset version, the following
  * saves one instruction and avoids clobbering a temp register.
  */
-#define arch_raw_cpu_ptr(ptr)  \
+#define __arch_raw_cpu_ptr_qual(qual, ptr) \
 ({ \
unsigned long tcp_ptr__;\
-   asm volatile("add " __percpu_arg(1) ", %0"  \
+   asm qual ("add " __percpu_arg(1) ", %0" \
 : "=r" (tcp_ptr__) \
 : "m" (__my_cpu_var(this_cpu_off)),\
   "0" (ptr));  \
(typeof(*(ptr)) __kernel __force *)tcp_ptr__;   \
 })
+#endif /* USE_X86_SEG_SUPPORT && !defined(BUILD_VDSO32) && 
defined(CONFIG_X86_64) */
+
+#define arch_raw_cpu_ptr(ptr)  
__arch_raw_cpu_ptr_qual(volatile, ptr)
+#define arch_raw_cpu_ptr_preemption_disabled(ptr)  
__arch_raw_cpu_ptr_qual( , ptr)
 #else /* CONFIG_SMP */
 #define __percpu_seg_override
 #define __percpu_prefix""
-- 
2.17.1



[PATCH v4 2/9] x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()

2019-08-24 Thread Nadav Amit
The unification of these two functions allows to use them in the updated
SMP infrastrucutre.

To do so, remove the reason argument from flush_tlb_func_local(), add
a member to struct tlb_flush_info that says which CPU initiated the
flush and act accordingly. Optimize the size of flush_tlb_info while we
are at it.

Unfortunately, this prevents us from using a constant tlb_flush_info for
arch_tlbbatch_flush(), but in a later stage we will inline
tlb_flush_info into the IPI data, so it should not have an impact
eventually.

Reviewed-by: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/tlbflush.h |  5 +-
 arch/x86/mm/tlb.c   | 85 +++--
 2 files changed, 41 insertions(+), 49 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 6f66d841262d..2f6e9be163ae 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -559,8 +559,9 @@ struct flush_tlb_info {
unsigned long   start;
unsigned long   end;
u64 new_tlb_gen;
-   unsigned intstride_shift;
-   boolfreed_tables;
+   unsigned intinitiating_cpu;
+   u8  stride_shift;
+   u8  freed_tables;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e6a9edc5baaf..2674f55ed9a1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -292,7 +292,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 * NB: leave_mm() calls us with prev == NULL and tsk == NULL.
 */
 
-   /* We don't want flush_tlb_func_* to run concurrently with us. */
+   /* We don't want flush_tlb_func() to run concurrently with us. */
if (IS_ENABLED(CONFIG_PROVE_LOCKING))
WARN_ON_ONCE(!irqs_disabled());
 
@@ -512,14 +512,13 @@ void initialize_tlbstate_and_flush(void)
 }
 
 /*
- * flush_tlb_func_common()'s memory ordering requirement is that any
+ * flush_tlb_func()'s memory ordering requirement is that any
  * TLB fills that happen after we flush the TLB are ordered after we
  * read active_mm's tlb_gen.  We don't need any explicit barriers
  * because all x86 flush operations are serializing and the
  * atomic64_read operation won't be reordered by the compiler.
  */
-static void flush_tlb_func_common(const struct flush_tlb_info *f,
- bool local, enum tlb_flush_reason reason)
+static void flush_tlb_func(void *info)
 {
/*
 * We have three different tlb_gen values in here.  They are:
@@ -530,14 +529,26 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
 * - f->new_tlb_gen: the generation that the requester of the flush
 *   wants us to catch up to.
 */
+   const struct flush_tlb_info *f = info;
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
u64 mm_tlb_gen = atomic64_read(_mm->context.tlb_gen);
u64 local_tlb_gen = 
this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+   bool local = smp_processor_id() == f->initiating_cpu;
+   unsigned long nr_invalidate = 0;
 
/* This code cannot presently handle being reentered. */
VM_WARN_ON(!irqs_disabled());
 
+   if (!local) {
+   inc_irq_stat(irq_tlb_count);
+   count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+
+   /* Can only happen on remote CPUs */
+   if (f->mm && f->mm != loaded_mm)
+   return;
+   }
+
if (unlikely(loaded_mm == _mm))
return;
 
@@ -565,8 +576,7 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
 * be handled can catch us all the way up, leaving no work for
 * the second flush.
 */
-   trace_tlb_flush(reason, 0);
-   return;
+   goto done;
}
 
WARN_ON_ONCE(local_tlb_gen > mm_tlb_gen);
@@ -613,46 +623,34 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
f->new_tlb_gen == local_tlb_gen + 1 &&
f->new_tlb_gen == mm_tlb_gen) {
/* Partial flush */
-   unsigned long nr_invalidate = (f->end - f->start) >> 
f->stride_shift;
unsigned long addr = f->start;
 
+   nr_invalidate = (f->end - f->start) >> f->stride_shift;
+
while (addr < f->end) {
__flush_tlb_one_user(addr);
addr += 1UL << f->stride_shift;
}
 

[PATCH v4 5/9] x86/mm/tlb: Privatize cpu_tlbstate

2019-08-24 Thread Nadav Amit
cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.

Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/tlbflush.h | 39 ++---
 arch/x86/mm/init.c  |  2 +-
 arch/x86/mm/tlb.c   | 15 -
 3 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 559195f79c2f..1f88ea410ff3 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -178,23 +178,6 @@ struct tlb_state {
u16 loaded_mm_asid;
u16 next_asid;
 
-   /*
-* We can be in one of several states:
-*
-*  - Actively using an mm.  Our CPU's bit will be set in
-*mm_cpumask(loaded_mm) and is_lazy == false;
-*
-*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
-*will not be set in mm_cpumask(_mm) and is_lazy == false.
-*
-*  - Lazily using a real mm.  loaded_mm != _mm, our bit
-*is set in mm_cpumask(loaded_mm), but is_lazy == true.
-*We're heuristically guessing that the CR3 load we
-*skipped more than makes up for the overhead added by
-*lazy mode.
-*/
-   bool is_lazy;
-
/*
 * If set we changed the page tables in such a way that we
 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
@@ -240,7 +223,27 @@ struct tlb_state {
 */
struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
 };
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
+DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
+
+struct tlb_state_shared {
+   /*
+* We can be in one of several states:
+*
+*  - Actively using an mm.  Our CPU's bit will be set in
+*mm_cpumask(loaded_mm) and is_lazy == false;
+*
+*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
+*will not be set in mm_cpumask(_mm) and is_lazy == false.
+*
+*  - Lazily using a real mm.  loaded_mm != _mm, our bit
+*is set in mm_cpumask(loaded_mm), but is_lazy == true.
+*We're heuristically guessing that the CR3 load we
+*skipped more than makes up for the overhead added by
+*lazy mode.
+*/
+   bool is_lazy;
+};
+DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 
 /*
  * Blindly accessing user memory from NMI context can be dangerous
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index fd10d91a6115..34027f36a944 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -951,7 +951,7 @@ void __init zone_sizes_init(void)
free_area_init_nodes(max_zone_pfns);
 }
 
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
.next_asid = 1,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5376a5447bd0..24c9839e3d9b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -145,7 +145,7 @@ void leave_mm(int cpu)
return;
 
/* Warn if we're not lazy. */
-   WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
+   WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy));
 
switch_mm(NULL, _mm, NULL);
 }
@@ -277,7 +277,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 {
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-   bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
+   bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
bool need_flush;
@@ -322,7 +322,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
-   this_cpu_write(cpu_tlbstate.is_lazy, false);
+   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
/*
 * The membarrier system call requires a full memory barrier and
@@ -463,7 +463,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct 
task_struct *tsk)
if (this_cpu_read(cpu_tlbstate.loaded_mm) == _mm)
return;
 
-   this_cpu_write(cpu_tlbstate.is_lazy, true);
+   this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
 }
 
 /*
@@ -555,7 +555,7 @@ static void flush_tlb_func(void *info)
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
   loaded_mm->context.ctx

[PATCH v4 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-08-24 Thread Nadav Amit
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Reviewed-by: Michael Kelley  # Hyper-v parts
Reviewed-by: Juergen Gross  # Xen and paravirt parts
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  8 ++---
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 45 ++-
 arch/x86/xen/mmu_pv.c | 11 +++
 include/trace/events/xen.h|  2 +-
 10 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..8740d8b21db3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 69089d46f128..bc4829c9b3f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-   const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 70b654f3ffe5..63fa751344bf 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -206,8 +206,8 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
-   void (*flush_tlb_others)(const struct cpumask *cpus,
-const struct flush_tlb_info *info);
+   void (*flush_tlb_multi)(co

[PATCH v4 9/9] x86/mm/tlb: Remove unnecessary uses of the inline keyword

2019-08-24 Thread Nadav Amit
The compiler is smart enough without these hints.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Suggested-by: Dave Hansen 
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3dca146edcf1..0addc6e84126 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -189,7 +189,7 @@ static void sync_current_stack_to_mm(struct mm_struct *mm)
}
 }
 
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
 {
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
@@ -741,7 +741,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, 
flush_tlb_info);
 static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
 #endif
 
-static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
+static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
unsigned long start, unsigned long end,
unsigned int stride_shift, bool freed_tables,
u64 new_tlb_gen)
@@ -768,7 +768,7 @@ static inline struct flush_tlb_info 
*get_flush_tlb_info(struct mm_struct *mm,
return info;
 }
 
-static inline void put_flush_tlb_info(void)
+static void put_flush_tlb_info(void)
 {
 #ifdef CONFIG_DEBUG_VM
/* Complete reentrency prevention checks */
-- 
2.17.1



[PATCH v4 1/9] smp: Run functions concurrently in smp_call_function_many()

2019-08-24 Thread Nadav Amit
Currently, on_each_cpu() and similar functions do not exploit the
potential of concurrency: the function is first executed remotely and
only then it is executed locally. Functions such as TLB flush can take
considerable time, so this provides an opportunity for performance
optimization.

To do so, introduce __smp_call_function_many(), which allows the callers
to provide local and remote functions that should be executed, and run
them concurrently. Keep smp_call_function_many() semantic as it is today
for backward compatibility: the called function is not executed in this
case locally.

__smp_call_function_many() does not use the optimized version for a
single remote target that smp_call_function_single() implements. For
synchronous function call, smp_call_function_single() keeps a
call_single_data (which is used for synchronization) on the stack.
Interestingly, it seems that not using this optimization provides
greater performance improvements (greater speedup with a single remote
target than with multiple ones). Presumably, holding data structures
that are intended for synchronization on the stack can introduce
overheads due to TLB misses and false-sharing when the stack is used for
other purposes.

Adding support to run the functions concurrently required to remove a
micro-optimization in on_each_cpu() that disabled/enabled IRQs instead
of saving/restoring them. The benefit of running the local and remote
code concurrently is expected to be greater.

Reviewed-by: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 include/linux/smp.h |  34 ---
 kernel/smp.c| 138 +---
 2 files changed, 92 insertions(+), 80 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 6fc856c9eda5..d18d54199635 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -32,11 +32,6 @@ extern unsigned int total_cpus;
 int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 int wait);
 
-/*
- * Call a function on all processors
- */
-void on_each_cpu(smp_call_func_t func, void *info, int wait);
-
 /*
  * Call a function on processors specified by mask, which might include
  * the local one.
@@ -44,6 +39,17 @@ void on_each_cpu(smp_call_func_t func, void *info, int wait);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
void *info, bool wait);
 
+/*
+ * Call a function on all processors.  May be used during early boot while
+ * early_boot_irqs_disabled is set.
+ */
+static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
+{
+   preempt_disable();
+   on_each_cpu_mask(cpu_online_mask, func, info, wait);
+   preempt_enable();
+}
+
 /*
  * Call a function on each processor for which the supplied function
  * cond_func returns a positive value. This may include the local
@@ -102,8 +108,22 @@ extern void smp_cpus_done(unsigned int max_cpus);
  * Call a function on all other processors
  */
 void smp_call_function(smp_call_func_t func, void *info, int wait);
-void smp_call_function_many(const struct cpumask *mask,
-   smp_call_func_t func, void *info, bool wait);
+
+/*
+ * Flags to be used as as scf_flags argument of __smp_call_function_many().
+ */
+#define SCF_WAIT   (1U << 0)   /* Wait until function execution 
completed */
+#define SCF_RUN_LOCAL  (1U << 1)   /* Run also locally if local cpu is set 
in cpumask */
+
+void __smp_call_function_many(const struct cpumask *mask,
+ smp_call_func_t func, void *info,
+ unsigned int scf_flags);
+
+static inline void smp_call_function_many(const struct cpumask *mask,
+   smp_call_func_t func, void *info, bool wait)
+{
+   __smp_call_function_many(mask, func, info, wait ? SCF_WAIT : 0);
+}
 
 int smp_call_function_any(const struct cpumask *mask,
  smp_call_func_t func, void *info, int wait);
diff --git a/kernel/smp.c b/kernel/smp.c
index 7dbcb402c2fc..9faec688b34b 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -396,24 +396,28 @@ int smp_call_function_any(const struct cpumask *mask,
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
 /**
- * smp_call_function_many(): Run a function on a set of other CPUs.
+ * __smp_call_function_many(): Run a function on a set of CPUs.
  * @mask: The set of cpus to run on (only runs on online subset).
  * @func: The function to run. This must be fast and non-blocking.
  * @info: An arbitrary pointer to pass to the function.
- * @wait: If true, wait (atomically) until function has completed
- *on other CPUs.
- *
- * If @wait is true, then returns once @func has returned.
+ * @flags: Bitmask that controls the operation. If %SCF_WAIT is set, wait
+ *(atomically) until function has completed o

[PATCH v4 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

2019-08-24 Thread Nadav Amit
Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to
optimize the code. Open-coding eliminates the need for the indirect branch
that is used to call is_lazy(), and in CPUs that are vulnerable to
Spectre v2, it eliminates the retpoline. In addition, it allows to use a
preallocated cpumask to compute the CPUs that should be.

This would later allow us not to adapt on_each_cpu_cond_mask() to
support local and remote functions.

Note that calling tlb_is_not_lazy() for every CPU that needs to be
flushed, as done in native_flush_tlb_multi() might look ugly, but it is
equivalent to what is currently done in on_each_cpu_cond_mask().
Actually, native_flush_tlb_multi() does it more efficiently since it
avoids using an indirect branch for the matter.

Reviewed-by: Dave Hansen 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 40 +---
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2674f55ed9a1..c3ca3545d78a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -653,11 +653,13 @@ static void flush_tlb_func(void *info)
nr_invalidate);
 }
 
-static bool tlb_is_not_lazy(int cpu, void *data)
+static bool tlb_is_not_lazy(int cpu)
 {
return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info)
 {
@@ -701,12 +703,36 @@ void native_flush_tlb_others(const struct cpumask 
*cpumask,
 * up on the new contents of what used to be page tables, while
 * doing a speculative memory access.
 */
-   if (info->freed_tables)
-   smp_call_function_many(cpumask, flush_tlb_func,
-  (void *)info, 1);
-   else
-   on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func,
-   (void *)info, 1, GFP_ATOMIC, cpumask);
+   if (info->freed_tables) {
+   smp_call_function_many(cpumask, flush_tlb_func, (void *)info, 
1);
+   } else {
+   /*
+* Although we could have used on_each_cpu_cond_mask(),
+* open-coding it has performance advantages, as it eliminates
+* the need for indirect calls or retpolines. In addition, it
+* allows to use a designated cpumask for evaluating the
+* condition, instead of allocating one.
+*
+* This code works under the assumption that there are no nested
+* TLB flushes, an assumption that is already made in
+* flush_tlb_mm_range().
+*
+* cond_cpumask is logically a stack-local variable, but it is
+* more efficient to have it off the stack and not to allocate
+* it on demand. Preemption is disabled and this code is
+* non-reentrant.
+*/
+   struct cpumask *cond_cpumask = this_cpu_ptr(_tlb_mask);
+   int cpu;
+
+   cpumask_clear(cond_cpumask);
+
+   for_each_cpu(cpu, cpumask) {
+   if (tlb_is_not_lazy(cpu))
+   __cpumask_set_cpu(cpu, cond_cpumask);
+   }
+   smp_call_function_many(cond_cpumask, flush_tlb_func, (void 
*)info, 1);
+   }
 }
 
 /*
-- 
2.17.1



[PATCH v4 0/9] x86/tlb: Concurrent TLB flushes

2019-08-24 Thread Nadav Amit
[ Similar cover-letter to v3 with updated performance numbers on skylake.
  Sorry for the time it since the last version. ]

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each PTE flushing can take 100s
of cycles. This patch-set allows TLB flushes to be run concurrently:
first request the remote CPUs to initiate the flush, then run it
locally, and finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid, for
example, unwarranted false-sharing.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 56-logical-cores (28+SMT) Skylake, 5 repetitions:

sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
 --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

 Th.   tip-aug22 avg (stdev)   +patch-set avg (stdev)  change
 ---   -   --  --
 1  1152920 (7453)  1169469 (9059)  +1.4%
 2  1545832 (12555) 1584172 (10484) +2.4%
 4  2480703 (12039) 2518641 (12875) +1.5%
 8  3684486 (26007) 3840343 (44144) +4.2%
 16 4981567 (23565) 5125756 (15458) +2.8%
 32 5679542 (10116) 5887826 (6121)  +3.6%
 56 5630944 (17937) 5812514 (26264) +3.2%

(Note that on configurations with up to 28 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

 Th.   tip-aug22 avg (stdev)   +patch-set avg (stdev)  change
 ---   -   --  --
 1  1444119 (8524)  1469606 (10527) +1.7%
 2  1921540 (24169) 1961899 (14450) +2.1%
 4  3073716 (21786) 3199880 (16774) +4.1%
 8  4700698 (49534) 4802312 (11043) +2.1%
 16 6005180 (6366)  6006656 (31624)0%
 32 6826466 (10496) 6886622 (19110) +0.8%
 56 6832344 (13468) 6885586 (20646) +0.8%

The results are somewhat different than the results that have been obtained on
Haswell-X, which were sent before and the maximum performance improvement is
smaller. However, the performance improvement is significant.

v3 -> v4:
  * Merge flush_tlb_func_local and flush_tlb_func_remote() [Peter]
  * Prevent preemption on_each_cpu(). It is not needed, but it prevents
concerns. [Peter/tglx]
  * Adding acked-, review-by tags

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c |  10 +-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  52 +++
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c |  11 +-
 arch/x86/kernel/paravirt.c|   2 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 195 ++
 arch/x86/xen/mmu_pv.c |  11 +-
 i

[PATCH v4 6/9] x86/mm/tlb: Do not make is_lazy dirty for no reason

2019-08-24 Thread Nadav Amit
Blindly writing to is_lazy for no reason, when the written value is
identical to the old value, makes the cacheline dirty for no reason.
Avoid making such writes to prevent cache coherency traffic for no
reason.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Suggested-by: Dave Hansen 
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 24c9839e3d9b..1393b3cd3697 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -322,7 +322,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
-   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
+   if (was_lazy)
+   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
/*
 * The membarrier system call requires a full memory barrier and
-- 
2.17.1



[PATCH v4 7/9] cpumask: Mark functions as pure

2019-08-24 Thread Nadav Amit
cpumask_next_and() and cpumask_any_but() are pure, and marking them as
such seems to generate different and presumably better code for
native_flush_tlb_multi().

Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 include/linux/cpumask.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index b5a5a1ed9efd..2579700bf44a 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -225,7 +225,7 @@ static inline unsigned int cpumask_last(const struct 
cpumask *srcp)
return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
-unsigned int cpumask_next(int n, const struct cpumask *srcp);
+unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);
 
 /**
  * cpumask_next_zero - get the next unset cpu in a cpumask
@@ -242,8 +242,8 @@ static inline unsigned int cpumask_next_zero(int n, const 
struct cpumask *srcp)
return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1);
 }
 
-int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
-int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
+int __pure cpumask_next_and(int n, const struct cpumask *, const struct 
cpumask *);
+int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
 
 /**
-- 
2.17.1



[PATCH v4 8/9] x86/mm/tlb: Remove UV special case

2019-08-24 Thread Nadav Amit
SGI UV TLB flushes is outdated and will be replaced with compatible
smp_call_many APIC function in the future. For now, simplify the code by
removing the UV special case.

Cc: Peter Zijlstra 
Suggested-by: Andy Lutomirski 
Acked-by: Mike Travis 
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 25 -
 1 file changed, 25 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 1393b3cd3697..3dca146edcf1 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -679,31 +679,6 @@ void native_flush_tlb_multi(const struct cpumask *cpumask,
trace_tlb_flush(TLB_REMOTE_SEND_IPI,
(info->end - info->start) >> PAGE_SHIFT);
 
-   if (is_uv_system()) {
-   /*
-* This whole special case is confused.  UV has a "Broadcast
-* Assist Unit", which seems to be a fancy way to send IPIs.
-* Back when x86 used an explicit TLB flush IPI, UV was
-* optimized to use its own mechanism.  These days, x86 uses
-* smp_call_function_many(), but UV still uses a manual IPI,
-* and that IPI's action is out of date -- it does a manual
-* flush instead of calling flush_tlb_func().  This
-* means that the percpu tlb_gen variables won't be updated
-* and we'll do pointless flushes on future context switches.
-*
-* Rather than hooking native_flush_tlb_multi() here, I think
-* that UV should be updated so that smp_call_function_many(),
-* etc, are optimal on UV.
-*/
-   flush_tlb_func((void *)info);
-
-   cpumask = uv_flush_tlb_others(cpumask, info);
-   if (cpumask)
-   smp_call_function_many(cpumask, flush_tlb_func,
-  (void *)info, 1);
-   return;
-   }
-
/*
 * If no page tables were freed, we can skip sending IPIs to
 * CPUs in lazy TLB mode. They will flush the CPU themselves
-- 
2.17.1



Re: [PATCH] KVM: x86: Don't update RIP or do single-step on faulting emulation

2019-08-23 Thread Nadav Amit
> On Aug 23, 2019, at 1:55 PM, Sean Christopherson 
>  wrote:
> 
> Don't advance RIP or inject a single-step #DB if emulation signals a
> fault.  This logic applies to all state updates that are conditional on
> clean retirement of the emulation instruction, e.g. updating RFLAGS was
> previously handled by commit 38827dbd3fb85 ("KVM: x86: Do not update
> EFLAGS on faulting emulation").
> 
> Not advancing RIP is likely a nop, i.e. ctxt->eip isn't updated with
> ctxt->_eip until emulation "retires" anyways.  Skipping #DB injection
> fixes a bug reported by Andy Lutomirski where a #UD on SYSCALL due to
> invalid state with RFLAGS.RF=1 would loop indefinitely due to emulation
> overwriting the #UD with #DB and thus restarting the bad SYSCALL over
> and over.
> 
> Cc: Nadav Amit 
> Cc: sta...@vger.kernel.org
> Reported-by: Andy Lutomirski 
> Fixes: 663f4c61b803 ("KVM: x86: handle singlestep during emulation")
> Signed-off-by: Sean Christopherson 

Seems fine. I guess I should’ve found it before…

Consider running the relevant self-tests (e.g., single_test_syscall) to
avoid regressions.



Re: [PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-21 Thread Nadav Amit
> On Aug 21, 2019, at 12:13 PM, David Hildenbrand  wrote:
> 
> On 21.08.19 18:34, Nadav Amit wrote:
>>> On Aug 21, 2019, at 9:29 AM, David Hildenbrand  wrote:
>>> 
>>> On 21.08.19 18:23, Nadav Amit wrote:
>>>>> On Aug 21, 2019, at 9:05 AM, David Hildenbrand  wrote:
>>>>> 
>>>>> On 20.08.19 11:16, Nadav Amit wrote:
>>>>>> There is no reason to print warnings when balloon page allocation fails,
>>>>>> as they are expected and can be handled gracefully.  Since VMware
>>>>>> balloon now uses balloon-compaction infrastructure, and suppressed these
>>>>>> warnings before, it is also beneficial to suppress these warnings to
>>>>>> keep the same behavior that the balloon had before.
>>>>> 
>>>>> I am not sure if that's a good idea. The allocation warnings are usually
>>>>> the only trace of "the user/admin did something bad because he/she tried
>>>>> to inflate the balloon to an unsafe value". Believe me, I processed a
>>>>> couple of such bugreports related to virtio-balloon and the warning were
>>>>> very helpful for that.
>>>> 
>>>> Ok, so a message is needed, but does it have to be a generic frightening
>>>> warning?
>>>> 
>>>> How about using __GFP_NOWARN, and if allocation do something like:
>>>> 
>>>> pr_warn(“Balloon memory allocation failed”);
>>>> 
>>>> Or even something more informative? This would surely be less intimidating
>>>> for common users.
>>> 
>>> ratelimit would make sense :)
>>> 
>>> And yes, this would certainly be nicer.
>> 
>> Thanks. I will post v2 of the patch.
> 
> As discussed in v2, we already print a warning in virtio-balloon, so I
> am fine with this patch.
> 
> Reviewed-by: David Hildenbrand 

Michael,

If it is possible to get it to 5.3, to avoid behavioral change for VMware
balloon users, it would be great.

Thanks,
Nadav

Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit
> On Aug 21, 2019, at 12:12 PM, David Hildenbrand  wrote:
> 
> On 21.08.19 21:10, Nadav Amit wrote:
>>> On Aug 21, 2019, at 12:06 PM, David Hildenbrand  wrote:
>>> 
>>> On 21.08.19 20:59, Nadav Amit wrote:
>>>>> On Aug 21, 2019, at 11:57 AM, David Hildenbrand  wrote:
>>>>> 
>>>>> On 21.08.19 11:41, Nadav Amit wrote:
>>>>>> There is no reason to print generic warnings when balloon memory
>>>>>> allocation fails, as failures are expected and can be handled
>>>>>> gracefully. Since VMware balloon now uses balloon-compaction
>>>>>> infrastructure, and suppressed these warnings before, it is also
>>>>>> beneficial to suppress these warnings to keep the same behavior that the
>>>>>> balloon had before.
>>>>>> 
>>>>>> Since such warnings can still be useful to indicate that the balloon is
>>>>>> over-inflated, print more informative and less frightening warning if
>>>>>> allocation fails instead.
>>>>>> 
>>>>>> Cc: David Hildenbrand 
>>>>>> Cc: Jason Wang 
>>>>>> Signed-off-by: Nadav Amit 
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> v1->v2:
>>>>>> * Print informative warnings instead suppressing [David]
>>>>>> ---
>>>>>> mm/balloon_compaction.c | 7 ++-
>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>> 
>>>>>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>>>>>> index 798275a51887..0c1d1f7689f0 100644
>>>>>> --- a/mm/balloon_compaction.c
>>>>>> +++ b/mm/balloon_compaction.c
>>>>>> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>>>>>> struct page *balloon_page_alloc(void)
>>>>>> {
>>>>>>  struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>>>>>> -   __GFP_NOMEMALLOC | 
>>>>>> __GFP_NORETRY);
>>>>>> +   __GFP_NOMEMALLOC | __GFP_NORETRY 
>>>>>> |
>>>>>> +   __GFP_NOWARN);
>>>>>> +
>>>>>> +if (!page)
>>>>>> +pr_warn_ratelimited("memory balloon: memory allocation 
>>>>>> failed");
>>>>>> +
>>>>>>  return page;
>>>>>> }
>>>>>> EXPORT_SYMBOL_GPL(balloon_page_alloc);
>>>>> 
>>>>> Not sure if "memory balloon" is the right wording. hmmm.
>>>>> 
>>>>> Acked-by: David Hildenbrand 
>>>> 
>>>> Do you have a better suggestion?
>>> 
>>> Not really - that's why I ack'ed :)
>>> 
>>> However, thinking about it - what about moving the check + print to the
>>> caller and then using dev_warn... or sth. like simple "virtio_balloon:
>>> ..." ? You can then drop the warning for vmware balloon if you feel like
>>> not needing it.
>> 
>> Actually, there is already a warning that is printed by the virtue_balloon
>> in fill_balloon():
>> 
>>struct page *page = balloon_page_alloc();
>> 
>>if (!page) {
>>dev_info_ratelimited(>vdev->dev,
>> "Out of puff! Can't get %u 
>> pages\n",
>> VIRTIO_BALLOON_PAGES_PER_PAGE);
>>/* Sleep for at least 1/5 of a second before retry. */
>>msleep(200);
>>break;
>>}
>> 
>> So are you ok with going back to v1?
> 
> Whoops, I missed that - sorry - usually the warnings scream louder at me :D
> 
> Yes, v1 is fine with me!

Thanks, I missed this one too. This change should prevent making users
concerned for no good reason.



Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit
> On Aug 21, 2019, at 12:06 PM, David Hildenbrand  wrote:
> 
> On 21.08.19 20:59, Nadav Amit wrote:
>>> On Aug 21, 2019, at 11:57 AM, David Hildenbrand  wrote:
>>> 
>>> On 21.08.19 11:41, Nadav Amit wrote:
>>>> There is no reason to print generic warnings when balloon memory
>>>> allocation fails, as failures are expected and can be handled
>>>> gracefully. Since VMware balloon now uses balloon-compaction
>>>> infrastructure, and suppressed these warnings before, it is also
>>>> beneficial to suppress these warnings to keep the same behavior that the
>>>> balloon had before.
>>>> 
>>>> Since such warnings can still be useful to indicate that the balloon is
>>>> over-inflated, print more informative and less frightening warning if
>>>> allocation fails instead.
>>>> 
>>>> Cc: David Hildenbrand 
>>>> Cc: Jason Wang 
>>>> Signed-off-by: Nadav Amit 
>>>> 
>>>> ---
>>>> 
>>>> v1->v2:
>>>> * Print informative warnings instead suppressing [David]
>>>> ---
>>>> mm/balloon_compaction.c | 7 ++-
>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>>>> index 798275a51887..0c1d1f7689f0 100644
>>>> --- a/mm/balloon_compaction.c
>>>> +++ b/mm/balloon_compaction.c
>>>> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>>>> struct page *balloon_page_alloc(void)
>>>> {
>>>>struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>>>> - __GFP_NOMEMALLOC | __GFP_NORETRY);
>>>> + __GFP_NOMEMALLOC | __GFP_NORETRY |
>>>> + __GFP_NOWARN);
>>>> +
>>>> +  if (!page)
>>>> +  pr_warn_ratelimited("memory balloon: memory allocation failed");
>>>> +
>>>>return page;
>>>> }
>>>> EXPORT_SYMBOL_GPL(balloon_page_alloc);
>>> 
>>> Not sure if "memory balloon" is the right wording. hmmm.
>>> 
>>> Acked-by: David Hildenbrand 
>> 
>> Do you have a better suggestion?
> 
> Not really - that's why I ack'ed :)
> 
> However, thinking about it - what about moving the check + print to the
> caller and then using dev_warn... or sth. like simple "virtio_balloon:
> ..." ? You can then drop the warning for vmware balloon if you feel like
> not needing it.

Actually, there is already a warning that is printed by the virtue_balloon
in fill_balloon():

struct page *page = balloon_page_alloc();

if (!page) {
dev_info_ratelimited(>vdev->dev,
 "Out of puff! Can't get %u 
pages\n",
 VIRTIO_BALLOON_PAGES_PER_PAGE);
/* Sleep for at least 1/5 of a second before retry. */
msleep(200);
break;
}

So are you ok with going back to v1?



Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit
> On Aug 21, 2019, at 11:57 AM, David Hildenbrand  wrote:
> 
> On 21.08.19 11:41, Nadav Amit wrote:
>> There is no reason to print generic warnings when balloon memory
>> allocation fails, as failures are expected and can be handled
>> gracefully. Since VMware balloon now uses balloon-compaction
>> infrastructure, and suppressed these warnings before, it is also
>> beneficial to suppress these warnings to keep the same behavior that the
>> balloon had before.
>> 
>> Since such warnings can still be useful to indicate that the balloon is
>> over-inflated, print more informative and less frightening warning if
>> allocation fails instead.
>> 
>> Cc: David Hildenbrand 
>> Cc: Jason Wang 
>> Signed-off-by: Nadav Amit 
>> 
>> ---
>> 
>> v1->v2:
>>  * Print informative warnings instead suppressing [David]
>> ---
>> mm/balloon_compaction.c | 7 ++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>> index 798275a51887..0c1d1f7689f0 100644
>> --- a/mm/balloon_compaction.c
>> +++ b/mm/balloon_compaction.c
>> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>> struct page *balloon_page_alloc(void)
>> {
>>  struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>> -   __GFP_NOMEMALLOC | __GFP_NORETRY);
>> +   __GFP_NOMEMALLOC | __GFP_NORETRY |
>> +   __GFP_NOWARN);
>> +
>> +if (!page)
>> +pr_warn_ratelimited("memory balloon: memory allocation failed");
>> +
>>  return page;
>> }
>> EXPORT_SYMBOL_GPL(balloon_page_alloc);
> 
> Not sure if "memory balloon" is the right wording. hmmm.
> 
> Acked-by: David Hildenbrand 

Do you have a better suggestion?



[PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit
There is no reason to print generic warnings when balloon memory
allocation fails, as failures are expected and can be handled
gracefully. Since VMware balloon now uses balloon-compaction
infrastructure, and suppressed these warnings before, it is also
beneficial to suppress these warnings to keep the same behavior that the
balloon had before.

Since such warnings can still be useful to indicate that the balloon is
over-inflated, print more informative and less frightening warning if
allocation fails instead.

Cc: David Hildenbrand 
Cc: Jason Wang 
Signed-off-by: Nadav Amit 

---

v1->v2:
  * Print informative warnings instead suppressing [David]
---
 mm/balloon_compaction.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 798275a51887..0c1d1f7689f0 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
 struct page *balloon_page_alloc(void)
 {
struct page *page = alloc_page(balloon_mapping_gfp_mask() |
-  __GFP_NOMEMALLOC | __GFP_NORETRY);
+  __GFP_NOMEMALLOC | __GFP_NORETRY |
+  __GFP_NOWARN);
+
+   if (!page)
+   pr_warn_ratelimited("memory balloon: memory allocation failed");
+
return page;
 }
 EXPORT_SYMBOL_GPL(balloon_page_alloc);
-- 
2.17.1



Re: [PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-21 Thread Nadav Amit
> On Aug 21, 2019, at 9:29 AM, David Hildenbrand  wrote:
> 
> On 21.08.19 18:23, Nadav Amit wrote:
>>> On Aug 21, 2019, at 9:05 AM, David Hildenbrand  wrote:
>>> 
>>> On 20.08.19 11:16, Nadav Amit wrote:
>>>> There is no reason to print warnings when balloon page allocation fails,
>>>> as they are expected and can be handled gracefully.  Since VMware
>>>> balloon now uses balloon-compaction infrastructure, and suppressed these
>>>> warnings before, it is also beneficial to suppress these warnings to
>>>> keep the same behavior that the balloon had before.
>>> 
>>> I am not sure if that's a good idea. The allocation warnings are usually
>>> the only trace of "the user/admin did something bad because he/she tried
>>> to inflate the balloon to an unsafe value". Believe me, I processed a
>>> couple of such bugreports related to virtio-balloon and the warning were
>>> very helpful for that.
>> 
>> Ok, so a message is needed, but does it have to be a generic frightening
>> warning?
>> 
>> How about using __GFP_NOWARN, and if allocation do something like:
>> 
>>  pr_warn(“Balloon memory allocation failed”);
>> 
>> Or even something more informative? This would surely be less intimidating
>> for common users.
> 
> ratelimit would make sense :)
> 
> And yes, this would certainly be nicer.

Thanks. I will post v2 of the patch.



Re: [PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-21 Thread Nadav Amit
> On Aug 21, 2019, at 9:05 AM, David Hildenbrand  wrote:
> 
> On 20.08.19 11:16, Nadav Amit wrote:
>> There is no reason to print warnings when balloon page allocation fails,
>> as they are expected and can be handled gracefully.  Since VMware
>> balloon now uses balloon-compaction infrastructure, and suppressed these
>> warnings before, it is also beneficial to suppress these warnings to
>> keep the same behavior that the balloon had before.
> 
> I am not sure if that's a good idea. The allocation warnings are usually
> the only trace of "the user/admin did something bad because he/she tried
> to inflate the balloon to an unsafe value". Believe me, I processed a
> couple of such bugreports related to virtio-balloon and the warning were
> very helpful for that.

Ok, so a message is needed, but does it have to be a generic frightening
warning?

How about using __GFP_NOWARN, and if allocation do something like:

  pr_warn(“Balloon memory allocation failed”);

Or even something more informative? This would surely be less intimidating
for common users.



[PATCH] VMCI: Release resource if the work is already queued

2019-08-20 Thread Nadav Amit
Francois reported that VMware balloon gets stuck after a balloon reset,
when the VMCI doorbell is removed. A similar error can occur when the
balloon driver is removed with the following splat:

[ 1088.622000] INFO: task modprobe:3565 blocked for more than 120 seconds.
[ 1088.622035]   Tainted: GW 5.2.0 #4
[ 1088.622087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 1088.622205] modprobeD0  3565   1450 0x
[ 1088.622210] Call Trace:
[ 1088.622246]  __schedule+0x2a8/0x690
[ 1088.622248]  schedule+0x2d/0x90
[ 1088.622250]  schedule_timeout+0x1d3/0x2f0
[ 1088.622252]  wait_for_completion+0xba/0x140
[ 1088.622320]  ? wake_up_q+0x80/0x80
[ 1088.622370]  vmci_resource_remove+0xb9/0xc0 [vmw_vmci]
[ 1088.622373]  vmci_doorbell_destroy+0x9e/0xd0 [vmw_vmci]
[ 1088.622379]  vmballoon_vmci_cleanup+0x6e/0xf0 [vmw_balloon]
[ 1088.622381]  vmballoon_exit+0x18/0xcc8 [vmw_balloon]
[ 1088.622394]  __x64_sys_delete_module+0x146/0x280
[ 1088.622408]  do_syscall_64+0x5a/0x130
[ 1088.622410]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1088.622415] RIP: 0033:0x7f54f62791b7
[ 1088.622421] Code: Bad RIP value.
[ 1088.622421] RSP: 002b:7fff2a949008 EFLAGS: 0206 ORIG_RAX: 
00b0
[ 1088.622426] RAX: ffda RBX: 55dff8b55d00 RCX: 7f54f62791b7
[ 1088.622426] RDX:  RSI: 0800 RDI: 55dff8b55d68
[ 1088.622427] RBP: 55dff8b55d00 R08: 7fff2a947fb1 R09: 
[ 1088.622427] R10: 7f54f62f5cc0 R11: 0206 R12: 55dff8b55d68
[ 1088.622428] R13: 0001 R14: 55dff8b55d68 R15: 7fff2a94a3f0

The cause for the bug is that when the "delayed" doorbell is invoked, it
takes a reference on the doorbell entry and schedules work that is
supposed to run the appropriate code and drop the doorbell entry
reference. The code ignores the fact that if the work is already queued,
it will not be scheduled to run one more time. As a result one of the
references would not be dropped. When the code waits for the reference
to get to zero, during balloon reset or module removal, it gets stuck.

Fix it. Drop the reference if schedule_work() indicates that the work is
already queued.

Note that this bug got more apparent (or apparent at all) due to
commit ce664331b248 ("vmw_balloon: VMCI_DOORBELL_SET does not check status").

Fixes: 83e2ec765be03 ("VMCI: doorbell implementation.")
Reported-by: Francois Rigault 
Cc: Jorgen Hansen 
Cc: Adit Ranadive 
Cc: Alexios Zavras 
Cc: Vishnu DASA 
Cc: sta...@vger.kernel.org
Signed-off-by: Nadav Amit 
---
 drivers/misc/vmw_vmci/vmci_doorbell.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/vmw_vmci/vmci_doorbell.c 
b/drivers/misc/vmw_vmci/vmci_doorbell.c
index bad89b6e0802..345addd9306d 100644
--- a/drivers/misc/vmw_vmci/vmci_doorbell.c
+++ b/drivers/misc/vmw_vmci/vmci_doorbell.c
@@ -310,7 +310,8 @@ int vmci_dbell_host_context_notify(u32 src_cid, struct 
vmci_handle handle)
 
entry = container_of(resource, struct dbell_entry, resource);
if (entry->run_delayed) {
-   schedule_work(>work);
+   if (!schedule_work(>work))
+   vmci_resource_put(resource);
} else {
entry->notify_cb(entry->client_data);
vmci_resource_put(resource);
@@ -361,7 +362,8 @@ static void dbell_fire_entries(u32 notify_idx)
atomic_read(>active) == 1) {
if (dbell->run_delayed) {
vmci_resource_get(>resource);
-   schedule_work(>work);
+   if (!schedule_work(>work))
+   vmci_resource_put(>resource);
} else {
dbell->notify_cb(dbell->client_data);
}
-- 
2.19.1



[PATCH] vmw_balloon: Fix offline page marking with compaction

2019-08-20 Thread Nadav Amit
The compaction code already marks pages as offline when it enqueues
pages in the ballooned page list, and removes the mapping when the pages
are removed from the list. VMware balloon also updates the flags,
instead of letting the balloon-compaction logic handle it, which causes
the assertion VM_BUG_ON_PAGE(!PageOffline(page)) to fire, when
__ClearPageOffline is called the second time. This causes the following
crash.

[  487.104520] kernel BUG at include/linux/page-flags.h:749!
[  487.106364] invalid opcode:  [#1] SMP DEBUG_PAGEALLOC PTI
[  487.107681] CPU: 7 PID: 1106 Comm: kworker/7:3 Not tainted 5.3.0-rc5balloon 
#227
[  487.109196] Hardware name: VMware, Inc. VMware Virtual Platform/440BX 
Desktop Reference Platform, BIOS 6.00 12/12/2018
[  487.111452] Workqueue: events_freezable vmballoon_work [vmw_balloon]
[  487.112779] RIP: 0010:vmballoon_release_page_list+0xaa/0x100 [vmw_balloon]
[  487.114200] Code: fe 48 c1 e7 06 4c 01 c7 8b 47 30 41 89 c1 41 81 e1 00 01 
00 f0 41 81 f9 00 00 00 f0 74 d3 48 c7 c6 08 a1 a1 c0 e8 06 0d e7 ea <0f> 0b 44 
89 f6 4c 89 c7 e8 49 9c e9 ea 49 8d 75 08 49 8b 45 08 4d
[  487.118033] RSP: 0018:b82f012bbc98 EFLAGS: 00010246
[  487.119135] RAX: 0037 RBX: 0001 RCX: 0006
[  487.120601] RDX:  RSI:  RDI: 9a85b6bd7620
[  487.122071] RBP: b82f012bbcc0 R08: 0001 R09: 
[  487.123536] R10:  R11:  R12: b82f012bbd00
[  487.125002] R13: e97f4598d9c0 R14:  R15: b82f012bbd34
[  487.126463] FS:  () GS:9a85b6bc() 
knlGS:
[  487.128110] CS:  0010 DS:  ES:  CR0: 80050033
[  487.129316] CR2: 7ffe6e413ea0 CR3: 000230b18001 CR4: 003606e0
[  487.130812] DR0:  DR1:  DR2: 
[  487.132283] DR3:  DR6: fffe0ff0 DR7: 0400
[  487.133749] Call Trace:
[  487.134333]  vmballoon_deflate+0x22c/0x390 [vmw_balloon]
[  487.135468]  vmballoon_work+0x6e7/0x913 [vmw_balloon]
[  487.136711]  ? process_one_work+0x21a/0x5e0
[  487.138581]  process_one_work+0x298/0x5e0
[  487.139926]  ? vmballoon_migratepage+0x310/0x310 [vmw_balloon]
[  487.141610]  ? process_one_work+0x298/0x5e0
[  487.143053]  worker_thread+0x41/0x400
[  487.144389]  kthread+0x12b/0x150
[  487.145582]  ? process_one_work+0x5e0/0x5e0
[  487.146937]  ? kthread_create_on_node+0x60/0x60
[  487.148637]  ret_from_fork+0x3a/0x50

Fix it by updating the PageOffline indication only when a 2MB page is
enqueued and dequeued. The 4KB pages will be handled correctly by the
balloon compaction logic.

Fixes: 83a8afa72e9c ("vmw_balloon: Compaction support")
Cc: David Hildenbrand 
Reported-by: Thomas Hellstrom 
Signed-off-by: Nadav Amit 
---
 drivers/misc/vmw_balloon.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 8840299420e0..5e6be1527571 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -691,7 +691,6 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
}
 
if (page) {
-   vmballoon_mark_page_offline(page, ctl->page_size);
/* Success. Add the page to the list and continue. */
list_add(>lru, >pages);
continue;
@@ -930,7 +929,6 @@ static void vmballoon_release_page_list(struct list_head 
*page_list,
 
list_for_each_entry_safe(page, tmp, page_list, lru) {
list_del(>lru);
-   vmballoon_mark_page_online(page, page_size);
__free_pages(page, vmballoon_page_order(page_size));
}
 
@@ -1005,6 +1003,7 @@ static void vmballoon_enqueue_page_list(struct vmballoon 
*b,
enum vmballoon_page_size_type page_size)
 {
unsigned long flags;
+   struct page *page;
 
if (page_size == VMW_BALLOON_4K_PAGE) {
balloon_page_list_enqueue(>b_dev_info, pages);
@@ -1014,6 +1013,11 @@ static void vmballoon_enqueue_page_list(struct vmballoon 
*b,
 * for the balloon compaction mechanism.
 */
spin_lock_irqsave(>b_dev_info.pages_lock, flags);
+
+   list_for_each_entry(page, pages, lru) {
+   vmballoon_mark_page_offline(page, VMW_BALLOON_2M_PAGE);
+   }
+
list_splice_init(pages, >huge_pages);
__count_vm_events(BALLOON_INFLATE, *n_pages *
  
vmballoon_page_in_frames(VMW_BALLOON_2M_PAGE));
@@ -1056,6 +1060,8 @@ static void vmballoon_dequeue_page_list(struct vmballoon 
*b,
/* 2MB pages */
spin_lock_irqsave(>b_dev_info.pages_lock, flags);
list_for_eac

[PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-20 Thread Nadav Amit
There is no reason to print warnings when balloon page allocation fails,
as they are expected and can be handled gracefully.  Since VMware
balloon now uses balloon-compaction infrastructure, and suppressed these
warnings before, it is also beneficial to suppress these warnings to
keep the same behavior that the balloon had before.

Cc: Jason Wang 
Signed-off-by: Nadav Amit 
---
 mm/balloon_compaction.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 798275a51887..26de020aae7b 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -124,7 +124,8 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
 struct page *balloon_page_alloc(void)
 {
struct page *page = alloc_page(balloon_mapping_gfp_mask() |
-  __GFP_NOMEMALLOC | __GFP_NORETRY);
+  __GFP_NOMEMALLOC | __GFP_NORETRY |
+  __GFP_NOWARN);
return page;
 }
 EXPORT_SYMBOL_GPL(balloon_page_alloc);
-- 
2.19.1



[PATCH] iommu/vt-d: Fix wrong analysis whether devices share the same bus

2019-08-20 Thread Nadav Amit
set_msi_sid_cb() is used to determine whether device aliases share the
same bus, but it can provide false indications that aliases use the same
bus when in fact they do not. The reason is that set_msi_sid_cb()
assumes that pdev is fixed, while actually pci_for_each_dma_alias() can
call fn() when pdev is set to a subordinate device.

As a result, running an VM on ESX with VT-d emulation enabled can
results in the log warning such as:

  DMAR: [INTR-REMAP] Request device [00:11.0] fault index 3b [fault reason 38] 
Blocked an interrupt request due to source-id verification failure

This seems to cause additional ata errors such as:
  ata3.00: qc timeout (cmd 0xa1)
  ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)

These timeouts also cause boot to be much longer and other errors.

Fix it by checking comparing the alias with the previous one instead.

Fixes: 3f0c625c6ae71 ("iommu/vt-d: Allow interrupts from the entire bus for 
aliased devices")
Cc: sta...@vger.kernel.org
Cc: Logan Gunthorpe 
Cc: David Woodhouse 
Cc: Joerg Roedel 
Cc: Jacob Pan 
Signed-off-by: Nadav Amit 
---
 drivers/iommu/intel_irq_remapping.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/intel_irq_remapping.c 
b/drivers/iommu/intel_irq_remapping.c
index 4786ca061e31..81e43c1df7ec 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -376,13 +376,13 @@ static int set_msi_sid_cb(struct pci_dev *pdev, u16 
alias, void *opaque)
 {
struct set_msi_sid_data *data = opaque;
 
+   if (data->count == 0 || PCI_BUS_NUM(alias) == PCI_BUS_NUM(data->alias))
+   data->busmatch_count++;
+
data->pdev = pdev;
data->alias = alias;
data->count++;
 
-   if (PCI_BUS_NUM(alias) == pdev->bus->number)
-   data->busmatch_count++;
-
return 0;
 }
 
-- 
2.17.1



Re: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case

2019-07-30 Thread Nadav Amit
> On Jul 18, 2019, at 7:25 PM, Mike Travis  wrote:
> 
> It is a fact that the UV is still the UV and SGI is now part of HPE. The
> current external product is known as SuperDome Flex. It is both up to date
> as well as very well maintained. The ACK I provided was an okay to change
> the code, but please make the description accurate.

Looking at the code again, if I remove the call to uv_flush_tlb_others(), is
there any use for tlb_uv.c?



Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

2019-07-25 Thread Nadav Amit
> On Jul 25, 2019, at 5:36 AM, Thomas Gleixner  wrote:
> 
> On Mon, 22 Jul 2019, Nadav Amit wrote:
>>> On Jul 22, 2019, at 11:51 AM, Thomas Gleixner  wrote:
>>> void on_each_cpu(void (*func) (void *info), void *info, int wait)
>>> {
>>>   unsigned long flags;
>>> 
>>>   preempt_disable();
>>> smp_call_function(func, info, wait);
>>> 
>>> smp_call_function() has another preempt_disable as it can be called
>>> separately and it does:
>>> 
>>>   preempt_disable();
>>>   smp_call_function_many(cpu_online_mask, func, info, wait);
>>> 
>>> Your new on_each_cpu() implementation does not. So there is a
>>> difference. Whether it matters or not is a different question, but that
>>> needs to be explained and documented.
>> 
>> Thanks for explaining - so your concern is for CPUs being offlined.
>> 
>> But unless I am missing something: on_each_cpu() calls __on_each_cpu_mask(),
>> which disables preemption and calls __smp_call_function_many().
>> 
>> Then  __smp_call_function_many() runs:
>> 
>>  cpumask_and(cfd->cpumask, mask, cpu_online_mask);
>> 
>> … before choosing which remote CPUs should run the function. So the only
>> case that I was missing is if the current CPU goes away and the function is
>> called locally.
>> 
>> Can it happen? I can add documentation and a debug assertion for this case.
> 
> I don't think it can happen:
> 
>  on_each_cpu()
>on_each_cpu_mask()
>  preempt_disable()
>__smp_call_function_many()
> 
> So if a CPU goes offline between on_each_cpu() and preempt_disable() then
> there is no damage. After the preempt_disable() it can't go away anymore
> and the task executing this cannot be migrated either.
> 
> So yes, it's safe, but please add a big fat comment so future readers won't
> be puzzled.

I will do. I will need some more time to respin the next version. I see that
what I build on top of it might require some changes, and I want to minimize
them.



Re: [RFC 7/7] x86/current: Aggressive caching of current

2019-07-22 Thread Nadav Amit
> On Jul 22, 2019, at 2:07 PM, Peter Zijlstra  wrote:
> 
> On Thu, Jul 18, 2019 at 10:41:10AM -0700, Nadav Amit wrote:
>> The current_task is supposed to be constant in each thread and therefore
>> does not need to be reread. There is already an attempt to cache it
>> using inline assembly, using this_cpu_read_stable(), which hides the
>> dependency on the read memory address.
> 
> Is that what it does?!, I never quite could understand
> percpu_stable_op().

That’s my understanding. I am not too pleased that I could not come up with
a general alternative to this_cpu_read_stable(), mainly because gcc does not
provide a way to get the type without the segment qualifier. Anyhow,
“current” seems to be the main pain-point.

I think a similar const-alias approach can also be used for stuff like
boot_cpu_has(). I have some patches for that somewhere, but the impact is
smaller. I do see some small, but measurable performance improvements with
this series. I’ll try to incorporate them in v1 once I have time.

Re: [RFC 3/7] x86/percpu: Use C for percpu accesses when possible

2019-07-22 Thread Nadav Amit
> On Jul 22, 2019, at 1:52 PM, Peter Zijlstra  wrote:
> 
> On Thu, Jul 18, 2019 at 10:41:06AM -0700, Nadav Amit wrote:
> 
>> diff --git a/arch/x86/include/asm/preempt.h b/arch/x86/include/asm/preempt.h
>> index 99a7fa9ab0a3..60f97b288004 100644
>> --- a/arch/x86/include/asm/preempt.h
>> +++ b/arch/x86/include/asm/preempt.h
>> @@ -91,7 +91,8 @@ static __always_inline void __preempt_count_sub(int val)
>>  */
>> static __always_inline bool __preempt_count_dec_and_test(void)
>> {
>> -return GEN_UNARY_RMWcc("decl", __preempt_count, e, __percpu_arg([var]));
>> +return GEN_UNARY_RMWcc("decl", __my_cpu_var(__preempt_count), e,
>> +   __percpu_arg([var]));
>> }
> 
> Should this be in the previous patch?

Yes, it should.

Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

2019-07-22 Thread Nadav Amit
> On Jul 22, 2019, at 12:47 PM, Rasmus Villemoes  
> wrote:
> 
> On 19/07/2019 02.58, Nadav Amit wrote:
> 
>> /*
>> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct 
>> arch_tlbflush_unmap_batch *batch)
>>  if (cpumask_test_cpu(cpu, >cpumask)) {
>>  lockdep_assert_irqs_enabled();
>>  local_irq_disable();
>> -flush_tlb_func_local(_flush_tlb_info);
>> +flush_tlb_func_local((void *)_flush_tlb_info);
>>  local_irq_enable();
>>  }
> 
> I think the confusion could be cleared up if you moved this hunk to
> patch 2 where it belongs - i.e. where you change the prototype of
> flush_tlb_func_local() and hence introduce the warning.

Yes, there is a small mess here - the constification should actually go
to a different patch… I’ll fix it.

Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-22 Thread Nadav Amit
> On Jul 22, 2019, at 12:14 PM, Peter Zijlstra  wrote:
> 
> On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote:
>> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask 
>> *cpumask,
>>   * doing a speculative memory access.
>>   */
>>  if (info->freed_tables) {
>> -smp_call_function_many(cpumask, flush_tlb_func_remote,
>> -   (void *)info, 1);
>> +__smp_call_function_many(cpumask, flush_tlb_func_remote,
>> + flush_tlb_func_local,
>> + (void *)info, 1);
>>  } else {
>>  /*
>>   * Although we could have used on_each_cpu_cond_mask(),
>> @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask 
>> *cpumask,
>>  if (tlb_is_not_lazy(cpu))
>>  __cpumask_set_cpu(cpu, cond_cpumask);
>>  }
>> -smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> +__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> + flush_tlb_func_local,
>>   (void *)info, 1);
>>  }
>> }
> 
> Do we really need that _local/_remote distinction? ISTR you had a patch
> that frobbed flush_tlb_info into the csd and that gave space
> constraints, but I'm not seeing that here (probably a wise, get stuff
> merged etc..).
> 
> struct __call_single_data {
>struct llist_node  llist;/* 0 8 */
>smp_call_func_tfunc; /* 8 8 */
>void * info; /*16 8 */
>unsigned int   flags;/*24 4 */
> 
>/* size: 32, cachelines: 1, members: 4 */
>/* padding: 4 */
>/* last cacheline: 32 bytes */
> };
> 
> struct flush_tlb_info {
>struct mm_struct * mm;   /* 0 8 */
>long unsigned int  start;/* 8 8 */
>long unsigned int  end;  /*16 8 */
>u64new_tlb_gen;  /*24 8 */
>unsigned int   stride_shift; /*32 4 */
>bool   freed_tables; /*36 1 */
> 
>/* size: 40, cachelines: 1, members: 6 */
>/* padding: 3 */
>/* last cacheline: 40 bytes */
> };
> 
> IIRC what you did was make void *__call_single_data::info the last
> member and a union until the full cacheline size (64). Given the above
> that would get us 24 bytes for csd, leaving us 40 for that
> flush_tlb_info.
> 
> But then we can still do something like the below, which doesn't change
> things and still gets rid of that dual function crud, simplifying
> smp_call_function_many again.
> 
> Index: linux-2.6/arch/x86/include/asm/tlbflush.h
> ===
> --- linux-2.6.orig/arch/x86/include/asm/tlbflush.h
> +++ linux-2.6/arch/x86/include/asm/tlbflush.h
> @@ -546,8 +546,9 @@ struct flush_tlb_info {
>   unsigned long   start;
>   unsigned long   end;
>   u64 new_tlb_gen;
> - unsigned intstride_shift;
> - boolfreed_tables;
> + unsigned intcpu;
> + unsigned short  stride_shift;
> + unsigned char   freed_tables;
> };
> 
> #define local_flush_tlb() __flush_tlb()
> Index: linux-2.6/arch/x86/mm/tlb.c
> ===
> --- linux-2.6.orig/arch/x86/mm/tlb.c
> +++ linux-2.6/arch/x86/mm/tlb.c
> @@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void *
>   flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
> }
> 
> +static void flush_tlb_func(void *info)
> +{
> + const struct flush_tlb_info *f = info;
> + enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN;
> + bool local = false;
> +
> + if (f->cpu == smp_processor_id()) {
> + local = true;
> + reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : 
> TLB_LOCAL_MM_SHOOTDOWN;
> + } else {
> + inc_irq_stat(irq_tlb_count);
> +
> + if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
> + return;
> +
> + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> + }
> +
> + flush_tlb_func_common(f, local, 

Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

2019-07-22 Thread Nadav Amit
> On Jul 22, 2019, at 11:51 AM, Thomas Gleixner  wrote:
> 
> On Mon, 22 Jul 2019, Nadav Amit wrote:
>>> On Jul 22, 2019, at 11:37 AM, Thomas Gleixner  wrote:
>>> 
>>> On Mon, 22 Jul 2019, Peter Zijlstra wrote:
>>> 
>>>> On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote:
>>>>> +/*
>>>>> + * Call a function on all processors.  May be used during early boot 
>>>>> while
>>>>> + * early_boot_irqs_disabled is set.
>>>>> + */
>>>>> +static inline void on_each_cpu(smp_call_func_t func, void *info, int 
>>>>> wait)
>>>>> +{
>>>>> + on_each_cpu_mask(cpu_online_mask, func, info, wait);
>>>>> +}
>>>> 
>>>> I'm thinking that one if buggy, nothing protects online mask here.
>>> 
>>> The current implementation has preemption disabled before touching
>>> cpu_online_mask which at least protects against a CPU going away as that
>>> prevents the stomp machine thread from getting on the CPU. But it's not
>>> protected against a CPU coming online concurrently.
>> 
>> I still don’t understand. If you called cpu_online_mask() and did not
>> disable preemption before calling it, you are already (today) not protected
>> against another CPU coming online. Disabling preemption in on_each_cpu()
>> will not solve it.
> 
> Disabling preemption _cannot_ protect against a CPU coming online. It only
> can protect against a CPU being offlined.
> 
> The current implementation of on_each_cpu() disables preemption _before_
> touching cpu_online_mask.
> 
> void on_each_cpu(void (*func) (void *info), void *info, int wait)
> {
>unsigned long flags;
> 
>preempt_disable();
>   smp_call_function(func, info, wait);
> 
> smp_call_function() has another preempt_disable as it can be called
> separately and it does:
> 
>preempt_disable();
>smp_call_function_many(cpu_online_mask, func, info, wait);
> 
> Your new on_each_cpu() implementation does not. So there is a
> difference. Whether it matters or not is a different question, but that
> needs to be explained and documented.

Thanks for explaining - so your concern is for CPUs being offlined.

But unless I am missing something: on_each_cpu() calls __on_each_cpu_mask(),
which disables preemption and calls __smp_call_function_many().

Then  __smp_call_function_many() runs:

cpumask_and(cfd->cpumask, mask, cpu_online_mask);

… before choosing which remote CPUs should run the function. So the only
case that I was missing is if the current CPU goes away and the function is
called locally.

Can it happen? I can add documentation and a debug assertion for this case.



Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

2019-07-22 Thread Nadav Amit
> On Jul 22, 2019, at 11:16 AM, Peter Zijlstra  wrote:
> 
> On Fri, Jul 19, 2019 at 11:23:06AM -0700, Dave Hansen wrote:
>> On 7/18/19 5:58 PM, Nadav Amit wrote:
>>> @@ -624,16 +622,11 @@ EXPORT_SYMBOL(on_each_cpu);
>>> void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
>>> void *info, bool wait)
>>> {
>>> -   int cpu = get_cpu();
>>> +   preempt_disable();
>>> 
>>> -   smp_call_function_many(mask, func, info, wait);
>>> -   if (cpumask_test_cpu(cpu, mask)) {
>>> -   unsigned long flags;
>>> -   local_irq_save(flags);
>>> -   func(info);
>>> -   local_irq_restore(flags);
>>> -   }
>>> -   put_cpu();
>>> +   __smp_call_function_many(mask, func, func, info, wait);
>>> +
>>> +   preempt_enable();
>>> }
>> 
>> The get_cpu() was missing it too, but it would be nice to add some
>> comments about why preempt needs to be off.  I was also thinking it
>> might make sense to do:
>> 
>>  cfd = get_cpu_var(cfd_data);
>>  __smp_call_function_many(cfd, ...);
>>  put_cpu_var(cfd_data);
>>  
>> instead of the explicit preempt_enable/disable(), but I don't feel too
>> strongly about it.
> 
> It is also required for cpu hotplug.

But then smpcfd_dead_cpu() will not respect the “cpu” argument. Do you still
prefer it this way (instead of the current preempt_enable() /
preempt_disable())?

Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

2019-07-22 Thread Nadav Amit
> On Jul 22, 2019, at 11:37 AM, Thomas Gleixner  wrote:
> 
> On Mon, 22 Jul 2019, Peter Zijlstra wrote:
> 
>> On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote:
>>> +/*
>>> + * Call a function on all processors.  May be used during early boot while
>>> + * early_boot_irqs_disabled is set.
>>> + */
>>> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
>>> +{
>>> +   on_each_cpu_mask(cpu_online_mask, func, info, wait);
>>> +}
>> 
>> I'm thinking that one if buggy, nothing protects online mask here.
> 
> The current implementation has preemption disabled before touching
> cpu_online_mask which at least protects against a CPU going away as that
> prevents the stomp machine thread from getting on the CPU. But it's not
> protected against a CPU coming online concurrently.

I still don’t understand. If you called cpu_online_mask() and did not
disable preemption before calling it, you are already (today) not protected
against another CPU coming online. Disabling preemption in on_each_cpu()
will not solve it.



Re: [PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

2019-07-22 Thread Nadav Amit
> On Jul 22, 2019, at 11:21 AM, Peter Zijlstra  wrote:
> 
> On Thu, Jul 18, 2019 at 05:58:29PM -0700, Nadav Amit wrote:
>> +/*
>> + * Call a function on all processors.  May be used during early boot while
>> + * early_boot_irqs_disabled is set.
>> + */
>> +static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
>> +{
>> +on_each_cpu_mask(cpu_online_mask, func, info, wait);
>> +}
> 
> I'm thinking that one if buggy, nothing protects online mask here.

on_each_cpu_mask() calls __on_each_cpu_mask() which would disable preemption.
The mask might change, but anyhow __smp_call_function_many() would “and” it,
after disabling preemption, with (the potentially updated) cpu_online_mask.

What is your concern?

Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

2019-07-21 Thread Nadav Amit
> On Jul 19, 2019, at 11:38 AM, Dave Hansen  wrote:
> 
> On 7/18/19 5:58 PM, Nadav Amit wrote:
>> +struct tlb_state_shared {
>> +/*
>> + * We can be in one of several states:
>> + *
>> + *  - Actively using an mm.  Our CPU's bit will be set in
>> + *mm_cpumask(loaded_mm) and is_lazy == false;
>> + *
>> + *  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
>> + *will not be set in mm_cpumask(_mm) and is_lazy == false.
>> + *
>> + *  - Lazily using a real mm.  loaded_mm != _mm, our bit
>> + *is set in mm_cpumask(loaded_mm), but is_lazy == true.
>> + *We're heuristically guessing that the CR3 load we
>> + *skipped more than makes up for the overhead added by
>> + *lazy mode.
>> + */
>> +bool is_lazy;
>> +};
>> +DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, 
>> cpu_tlbstate_shared);
> 
> Could we get a comment about what "shared" means and why we need shared
> state?
> 
> Should we change 'tlb_state' to 'tlb_state_private’?

I don’t feel strongly about either one. I perferred the one that is likely
to cause fewer changes and potential conflicts. Anyhow, I would add a better
comment as you asked for.

So it is really up to you.



Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

2019-07-19 Thread Nadav Amit
> On Jul 19, 2019, at 3:44 PM, Joe Perches  wrote:
> 
> On Fri, 2019-07-19 at 18:41 +0000, Nadav Amit wrote:
>>> On Jul 19, 2019, at 11:36 AM, Dave Hansen  wrote:
>>> 
>>> On 7/18/19 5:58 PM, Nadav Amit wrote:
>>>> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct 
>>>> arch_tlbflush_unmap_batch *batch)
>>>>if (cpumask_test_cpu(cpu, >cpumask)) {
>>>>lockdep_assert_irqs_enabled();
>>>>local_irq_disable();
>>>> -  flush_tlb_func_local(_flush_tlb_info);
>>>> +  flush_tlb_func_local((void *)_flush_tlb_info);
>>>>local_irq_enable();
>>>>}
>>> 
>>> This looks like superfluous churn.  Is it?
>> 
>> Unfortunately not, since full_flush_tlb_info is defined as const. Without it
>> you would get:
>> 
>> warning: passing argument 1 of ‘flush_tlb_func_local’ discards ‘const’ 
>> qualifier from pointer target type [-Wdiscarded-qualifiers]
>> 
>> And flush_tlb_func_local() should get (void *) argument since it is also
>> used by the SMP infrastructure.
> 
> huh?
> 
> commit 3db6d5a5ecaf0a778d721ccf9809248350d4bfaf
> Author: Nadav Amit 
> Date:   Thu Apr 25 16:01:43 2019 -0700
> 
> []
> 
> -static void flush_tlb_func_local(void *info, enum tlb_flush_reason reason)
> +static void flush_tlb_func_local(const void *info, enum tlb_flush_reason 
> reason)

Hopefully I decipher the “huh?” correctly...

When we moved the flush_tlb_info off the stack, in the patch that you
reference, we created a global full_flush_tlb_info variable, which was set
as const and delivered to flush_tlb_func_local. I changed the signature of
the function to avoid casting. IIRC, setting the variable as constant
slightly improved the generated assembly, and anyhow made sense.

In this patch-set I need to change the first argument of
flush_tlb_func_local to be non-const to match the SMP function signature
which takes a single non-const void * argument. Yes, there is what seems to
be non-safe casting, but flush_tlb_func_common() casts it back into a const
variable.

I know that I also added the infrastructure to run a function locally in the
SMP infrastructure, but it made sense to have the same signature for local
function and remote ones.

Does it make more sense now?



Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

2019-07-19 Thread Nadav Amit
> On Jul 19, 2019, at 11:48 AM, Dave Hansen  wrote:
> 
> On 7/19/19 11:43 AM, Nadav Amit wrote:
>> Andy said that for the lazy tlb optimizations there might soon be more
>> shared state. If you prefer, I can move is_lazy outside of tlb_state, and
>> not set it in any alternative struct.
> 
> I just wanted to make sure that we capture these rules:
> 
> 1. If the data is only ever accessed on the "owning" CPU via
>   this_cpu_*(), put it in 'tlb_state'.
> 2. If the data is read by other CPUs, put it in 'tlb_state_shared'.
> 
> I actually like the idea of having two structs.

Yes, that’s exactly the idea. In the (1) case, we may even be able to mark
the struct with __thread qualifier, which IIRC would prevent memory barriers
from causing these values being reread.

Re: [PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

2019-07-19 Thread Nadav Amit
> On Jul 19, 2019, at 11:38 AM, Dave Hansen  wrote:
> 
> On 7/18/19 5:58 PM, Nadav Amit wrote:
>> +struct tlb_state_shared {
>> +/*
>> + * We can be in one of several states:
>> + *
>> + *  - Actively using an mm.  Our CPU's bit will be set in
>> + *mm_cpumask(loaded_mm) and is_lazy == false;
>> + *
>> + *  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
>> + *will not be set in mm_cpumask(_mm) and is_lazy == false.
>> + *
>> + *  - Lazily using a real mm.  loaded_mm != _mm, our bit
>> + *is set in mm_cpumask(loaded_mm), but is_lazy == true.
>> + *We're heuristically guessing that the CR3 load we
>> + *skipped more than makes up for the overhead added by
>> + *lazy mode.
>> + */
>> +bool is_lazy;
>> +};
>> +DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, 
>> cpu_tlbstate_shared);
> 
> Could we get a comment about what "shared" means and why we need shared
> state?
> 
> Should we change 'tlb_state' to 'tlb_state_private’?

Andy said that for the lazy tlb optimizations there might soon be more
shared state. If you prefer, I can move is_lazy outside of tlb_state, and
not set it in any alternative struct.



Re: [PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

2019-07-19 Thread Nadav Amit
> On Jul 19, 2019, at 11:36 AM, Dave Hansen  wrote:
> 
> On 7/18/19 5:58 PM, Nadav Amit wrote:
>> @@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct 
>> arch_tlbflush_unmap_batch *batch)
>>  if (cpumask_test_cpu(cpu, >cpumask)) {
>>  lockdep_assert_irqs_enabled();
>>  local_irq_disable();
>> -flush_tlb_func_local(_flush_tlb_info);
>> +flush_tlb_func_local((void *)_flush_tlb_info);
>>  local_irq_enable();
>>  }
> 
> This looks like superfluous churn.  Is it?

Unfortunately not, since full_flush_tlb_info is defined as const. Without it
you would get:

warning: passing argument 1 of ‘flush_tlb_func_local’ discards ‘const’ 
qualifier from pointer target type [-Wdiscarded-qualifiers]

And flush_tlb_func_local() should get (void *) argument since it is also
used by the SMP infrastructure.

Re: [PATCH v3 8/9] x86/mm/tlb: Remove UV special case

2019-07-18 Thread Nadav Amit
> On Jul 18, 2019, at 7:25 PM, Mike Travis  wrote:
> 
> It is a fact that the UV is still the UV and SGI is now part of HPE. The 
> current external product is known as SuperDome Flex.  It is both up to date 
> as well as very well maintained.  The ACK I provided was an okay to change 
> the code, but please make the description accurate.

Indeed. Sorry for that - I will update it in v4. I guess you will be ok with
me copy-pasting parts of your response into the commit log.



[RFC 6/7] x86/percpu: Optimized arch_raw_cpu_ptr()

2019-07-18 Thread Nadav Amit
Implementing arch_raw_cpu_ptr() in C, allows the compiler to perform
better optimizations, such as setting an appropriate base to compute
the address instead of an add instruction.

The benefit of this computation is relevant only when using compiler
segment qualifiers. It is inapplicable to use this method when the
address size is greater than the maximum operand size, as it is when
building vdso32.

Distinguish between the two cases in which preemption is disabled (as
happens when this_cpu_ptr() is used) and enabled (when raw_cpu_ptr() is
used).

This allows optimizations, for instance in rcu_dynticks_eqs_exit(),
the following code:

  mov$0x2bbc0,%rax
  add%gs:0x7ef07570(%rip),%rax  # 0x10358 
  lock xadd %edx,0xd8(%rax)

Turns with this patch into:

  mov%gs:0x7ef08aa5(%rip),%rax  # 0x10358 
  lock xadd %edx,0x2bc58(%rax)

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/percpu.h | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 13987f9bc82f..8bac7db397cc 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -73,20 +73,41 @@
 #endif /* USE_X86_SEG_SUPPORT */
 
 #define __my_cpu_offsetthis_cpu_read(this_cpu_off)
+#define __raw_my_cpu_offset__this_cpu_read(this_cpu_off)
+#define __my_cpu_ptr(ptr)  (__my_cpu_type(*ptr) *)(uintptr_t)(ptr)
 
+#if USE_X86_SEG_SUPPORT && (!defined(BUILD_VDSO32) || defined(CONFIG_X86_64))
+/*
+ * Efficient implementation for cases in which the compiler supports C 
segments.
+ * Allows the compiler to perform additional optimizations that can save more
+ * instructions.
+ *
+ * This optimized version can only be used if the pointer size equals to native
+ * operand size, which does not happen when vdso32 is used.
+ */
+#define __arch_raw_cpu_ptr_qual(qual, ptr) \
+({ \
+   (qual typeof(*(ptr)) __kernel __force *)((uintptr_t)(ptr) + \
+   __my_cpu_offset);   \
+})
+#else /* USE_X86_SEG_SUPPORT && (!defined(BUILD_VDSO32) || 
defined(CONFIG_X86_64)) */
 /*
  * Compared to the generic __my_cpu_offset version, the following
  * saves one instruction and avoids clobbering a temp register.
  */
-#define arch_raw_cpu_ptr(ptr)  \
+#define __arch_raw_cpu_ptr_qual(qual, ptr) \
 ({ \
unsigned long tcp_ptr__;\
-   asm volatile("add " __percpu_arg(1) ", %0"  \
+   asm qual ("add " __percpu_arg(1) ", %0" \
 : "=r" (tcp_ptr__) \
 : "m" (__my_cpu_var(this_cpu_off)),\
   "0" (ptr));  \
(typeof(*(ptr)) __kernel __force *)tcp_ptr__;   \
 })
+#endif /* USE_X86_SEG_SUPPORT && (!defined(BUILD_VDSO32) || 
defined(CONFIG_X86_64)) */
+
+#define arch_raw_cpu_ptr(ptr)  
__arch_raw_cpu_ptr_qual(volatile, ptr)
+#define arch_raw_cpu_ptr_preemption_disabled(ptr)  
__arch_raw_cpu_ptr_qual( , ptr)
 #else /* CONFIG_SMP */
 #define __percpu_seg_override
 #define __percpu_prefix""
-- 
2.17.1



[RFC 4/7] x86: Fix possible caching of current_task

2019-07-18 Thread Nadav Amit
this_cpu_read_stable() is allowed and supposed to cache and return the
same value, specifically for current_task. It actually does not cache
current_task very well, which hinders possible invalid caching when the
task is switched in __switch_to().

Fix the possible caching by avoiding the use of current in
__switch_to()'s dynamic extent.

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/fpu/internal.h|  7 ---
 arch/x86/include/asm/resctrl_sched.h   | 14 +++---
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |  4 ++--
 arch/x86/kernel/process_32.c   |  4 ++--
 arch/x86/kernel/process_64.c   |  4 ++--
 5 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index 4c95c365058a..b537788600fe 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -588,9 +588,10 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, 
int cpu)
 
 /*
  * Load PKRU from the FPU context if available. Delay loading of the
- * complete FPU state until the return to userland.
+ * complete FPU state until the return to userland. Avoid reading current 
during
+ * switch.
  */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct task_struct *task, struct fpu 
*new_fpu)
 {
u32 pkru_val = init_pkru_value;
struct pkru_state *pk;
@@ -598,7 +599,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
if (!static_cpu_has(X86_FEATURE_FPU))
return;
 
-   set_thread_flag(TIF_NEED_FPU_LOAD);
+   set_ti_thread_flag(task_thread_info(task), TIF_NEED_FPU_LOAD);
 
if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
return;
diff --git a/arch/x86/include/asm/resctrl_sched.h 
b/arch/x86/include/asm/resctrl_sched.h
index f6b7fe2833cc..9a00d9df9d02 100644
--- a/arch/x86/include/asm/resctrl_sched.h
+++ b/arch/x86/include/asm/resctrl_sched.h
@@ -51,7 +51,7 @@ DECLARE_STATIC_KEY_FALSE(rdt_mon_enable_key);
  *   simple as possible.
  * Must be called with preemption disabled.
  */
-static void __resctrl_sched_in(void)
+static void __resctrl_sched_in(struct task_struct *task)
 {
struct resctrl_pqr_state *state = this_cpu_ptr(_state);
u32 closid = state->default_closid;
@@ -62,13 +62,13 @@ static void __resctrl_sched_in(void)
 * Else use the closid/rmid assigned to this cpu.
 */
if (static_branch_likely(_alloc_enable_key)) {
-   if (current->closid)
+   if (task->closid)
closid = current->closid;
}
 
if (static_branch_likely(_mon_enable_key)) {
-   if (current->rmid)
-   rmid = current->rmid;
+   if (task->rmid)
+   rmid = task->rmid;
}
 
if (closid != state->cur_closid || rmid != state->cur_rmid) {
@@ -78,15 +78,15 @@ static void __resctrl_sched_in(void)
}
 }
 
-static inline void resctrl_sched_in(void)
+static inline void resctrl_sched_in(struct task_struct *task)
 {
if (static_branch_likely(_enable_key))
-   __resctrl_sched_in();
+   __resctrl_sched_in(task);
 }
 
 #else
 
-static inline void resctrl_sched_in(void) {}
+static inline void resctrl_sched_in(struct task_struct *task) {}
 
 #endif /* CONFIG_X86_CPU_RESCTRL */
 
diff --git a/arch/x86/kernel/cpu/resctrl/rdtgroup.c 
b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
index bf3034994754..71bd82a6e3c6 100644
--- a/arch/x86/kernel/cpu/resctrl/rdtgroup.c
+++ b/arch/x86/kernel/cpu/resctrl/rdtgroup.c
@@ -311,7 +311,7 @@ static void update_cpu_closid_rmid(void *info)
 * executing task might have its own closid selected. Just reuse
 * the context switch code.
 */
-   resctrl_sched_in();
+   resctrl_sched_in(current);
 }
 
 /*
@@ -536,7 +536,7 @@ static void move_myself(struct callback_head *head)
 
preempt_disable();
/* update PQR_ASSOC MSR to make resource group go into effect */
-   resctrl_sched_in();
+   resctrl_sched_in(current);
preempt_enable();
 
kfree(callback);
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index b8ceec4974fe..699a4c95ab13 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -292,10 +292,10 @@ __switch_to(struct task_struct *prev_p, struct 
task_struct *next_p)
 
this_cpu_write(current_task, next_p);
 
-   switch_fpu_finish(next_fpu);
+   switch_fpu_finish(next_p, next_fpu);
 
/* Load the Intel cache allocation PQR MSR. */
-   resctrl_sched_in();
+   resctrl_sched_in(next_p);
 
return prev_p;
 }
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 250e4c4ac6d9..e945bc744804 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -575,7 +575,7 @@ __switch

[RFC 3/7] x86/percpu: Use C for percpu accesses when possible

2019-07-18 Thread Nadav Amit
The percpu code mostly uses inline assembly. Using segment qualifiers
allows to use C code instead, which enables the compiler to perform
various optimizations (e.g., CSE). For example, in __schedule() the
following two instructions:

  mov%gs:0x7e5f1eff(%rip),%edx# 0x10350 
  movslq %edx,%rdx

Turn with this patch into:

  movslq %gs:0x7e5f2e6e(%rip),%rax# 0x10350 

In addition, operations that have no guarantee against concurrent
interrupts or preemption, such as __this_cpu_cmpxchg() can be further
optimized by the compiler when they are implemented in C, for example
in call_timer_fn().

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/percpu.h  | 115 ++---
 arch/x86/include/asm/preempt.h |   3 +-
 2 files changed, 107 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 1fe348884477..13987f9bc82f 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -439,13 +439,88 @@ do {  
\
  */
 #define this_cpu_read_stable(var)  percpu_stable_op("mov", var)
 
+#if USE_X86_SEG_SUPPORT
+
+#define __raw_cpu_read(qual, pcp)  \
+({ \
+   *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp));   \
+})
+
+#define __raw_cpu_write(qual, pcp, val)
\
+   do {\
+   *(qual __my_cpu_type(pcp) *)__my_cpu_ptr(&(pcp)) = (val); \
+   } while (0)
+
+/*
+ * Performance-wise, C operations are only more efficient than their inline
+ * assembly counterparts for non-volatile variables (__this_*) and for volatile
+ * loads and stores.
+ *
+ * Since we do not use assembly, we are free to define 64-bit operations
+ * on 32-bit architecture
+ */
+#define __raw_cpu_add(pcp, val)do { __my_cpu_var(pcp) += 
(val); } while (0)
+#define __raw_cpu_and(pcp, val)do { __my_cpu_var(pcp) &= 
(val); } while (0)
+#define __raw_cpu_or(pcp, val) do { __my_cpu_var(pcp) |= (val); } 
while (0)
+#define __raw_cpu_add_return(pcp, val) ({ __my_cpu_var(pcp) += (val); })
+
+#define __raw_cpu_xchg(pcp, val)   \
+({ \
+   typeof(pcp) pxo_ret__ = __my_cpu_var(pcp);  \
+   \
+   __my_cpu_var(pcp) = (val);  \
+   pxo_ret__;  \
+})
+
+#define __raw_cpu_cmpxchg(pcp, oval, nval) \
+({ \
+   __my_cpu_type(pcp) *__p = __my_cpu_ptr(&(pcp)); \
+   \
+   typeof(pcp) __ret = *__p;   \
+   \
+   if (__ret == (oval))\
+   *__p = nval;\
+   __ret;  \
+})
+
+#define raw_cpu_read_1(pcp)__raw_cpu_read(, pcp)
+#define raw_cpu_read_2(pcp)__raw_cpu_read(, pcp)
+#define raw_cpu_read_4(pcp)__raw_cpu_read(, pcp)
+#define raw_cpu_write_1(pcp, val)  __raw_cpu_write(, pcp, val)
+#define raw_cpu_write_2(pcp, val)  __raw_cpu_write(, pcp, val)
+#define raw_cpu_write_4(pcp, val)  __raw_cpu_write(, pcp, val)
+#define raw_cpu_add_1(pcp, val)__raw_cpu_add(pcp, val)
+#define raw_cpu_add_2(pcp, val)__raw_cpu_add(pcp, val)
+#define raw_cpu_add_4(pcp, val)__raw_cpu_add(pcp, val)
+#define raw_cpu_and_1(pcp, val)__raw_cpu_and(pcp, val)
+#define raw_cpu_and_2(pcp, val)__raw_cpu_and(pcp, val)
+#define raw_cpu_and_4(pcp, val)__raw_cpu_and(pcp, val)
+#define raw_cpu_or_1(pcp, val) __raw_cpu_or(pcp, val)
+#define raw_cpu_or_2(pcp, val) __raw_cpu_or(pcp, val)
+#define raw_cpu_or_4(pcp, val) __raw_cpu_or(pcp, val)
+#define raw_cpu_xchg_1(pcp, val)   __raw_cpu_xchg(pcp, val)
+#define raw_cpu_xchg_2(pcp, val)   __raw_cpu_xchg(pcp, val)
+#define raw_cpu_xchg_4(pcp, val)   __raw_cpu_xchg(pcp, val)
+#define raw_cpu_add_return_1(pcp, val) __raw_cpu_add_return(pcp, val)
+#define raw_cpu_add_return_2(pcp, val) __raw_cpu_add_return(pcp, val)
+#define raw_cpu_add_return_4(pcp, val) __raw_cpu_add_return(pcp, val)
+#define raw_cpu_add_return_8(pcp, val) __raw_cpu_add_return(pcp, val)
+#define raw_cpu

[RFC 1/7] compiler: Report x86 segment support

2019-07-18 Thread Nadav Amit
GCC v6+ supports x86 segment qualifiers (__seg_gs and __seg_fs). Define
COMPILER_HAS_X86_SEG_SUPPORT when it is supported.

Signed-off-by: Nadav Amit 
---
 include/linux/compiler-gcc.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index e8579412ad21..787b5971e41f 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -149,6 +149,10 @@
 #define COMPILER_HAS_GENERIC_BUILTIN_OVERFLOW 1
 #endif
 
+#if GCC_VERSION >= 6
+#define COMPILER_HAS_X86_SEG_SUPPORT 1
+#endif
+
 /*
  * Turn individual warnings and errors on and off locally, depending
  * on version.
-- 
2.17.1



[RFC 0/7] x86/percpu: Use segment qualifiers

2019-07-18 Thread Nadav Amit
GCC 6+ supports segment qualifiers. Using them allows to implement
several optimizations:

1. Avoid unnecessary instructions when an operation is carried on
read/written per-cpu value, and instead allow the compiler to set
instructions that access per-cpu value directly.

2. Make this_cpu_ptr() more efficient and allow its value to be cached,
since preemption must be disabled when this_cpu_ptr() is used.

3. Provide better alternative for this_cpu_read_stable() that caches
values more efficiently using alias attribute to const variable.

4. Allow the compiler to perform other optimizations (e.g. CSE).

5. Use rip-relative addressing in per_cpu_read_stable(), which make it
PIE-ready.

"size" and Peter's compare do not seem to show the impact on code size
reduction correctly. Summing the code size according to nm on defconfig
shows a minor reduction from 11349763 to 11339840 (0.09%).

Nadav Amit (7):
  compiler: Report x86 segment support
  x86/percpu: Use compiler segment prefix qualifier
  x86/percpu: Use C for percpu accesses when possible
  x86: Fix possible caching of current_task
  percpu: Assume preemption is disabled on per_cpu_ptr()
  x86/percpu: Optimized arch_raw_cpu_ptr()
  x86/current: Aggressive caching of current

 arch/x86/include/asm/current.h |  30 +++
 arch/x86/include/asm/fpu/internal.h|   7 +-
 arch/x86/include/asm/percpu.h  | 293 +++--
 arch/x86/include/asm/preempt.h |   3 +-
 arch/x86/include/asm/resctrl_sched.h   |  14 +-
 arch/x86/kernel/cpu/Makefile   |   1 +
 arch/x86/kernel/cpu/common.c   |   7 +-
 arch/x86/kernel/cpu/current.c  |  16 ++
 arch/x86/kernel/cpu/resctrl/rdtgroup.c |   4 +-
 arch/x86/kernel/process_32.c   |   4 +-
 arch/x86/kernel/process_64.c   |   4 +-
 include/asm-generic/percpu.h   |  12 +
 include/linux/compiler-gcc.h   |   4 +
 include/linux/compiler.h   |   2 +-
 include/linux/percpu-defs.h|  33 ++-
 15 files changed, 346 insertions(+), 88 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/current.c

-- 
2.17.1



[RFC 2/7] x86/percpu: Use compiler segment prefix qualifier

2019-07-18 Thread Nadav Amit
Using a segment prefix qualifier is cleaner than using a segment prefix
in the inline assembly, and provides the compiler with more information,
telling it that __seg_gs:[addr] is different than [addr] when it
analyzes data dependencies. It also enables various optimizations that
will be implemented in the next patches.

Use segment prefix qualifiers when they are supported. Unfortunately,
gcc does not provide a way to remove segment qualifiers, which is needed
to use typeof() to create local instances of the per-cpu variable. For
this reason, do not use the segment qualifier for per-cpu variables, and
do casting using the segment qualifier instead.

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/percpu.h | 153 ++
 1 file changed, 102 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/percpu.h b/arch/x86/include/asm/percpu.h
index 2278797c769d..1fe348884477 100644
--- a/arch/x86/include/asm/percpu.h
+++ b/arch/x86/include/asm/percpu.h
@@ -45,8 +45,33 @@
 #include 
 #include 
 
+#if defined(COMPILER_HAS_X86_SEG_SUPPORT) && IS_ENABLED(CONFIG_SMP)
+#define USE_X86_SEG_SUPPORT1
+#else
+#define USE_X86_SEG_SUPPORT0
+#endif
+
 #ifdef CONFIG_SMP
+
+#if USE_X86_SEG_SUPPORT
+
+#ifdef CONFIG_X86_64
+#define __percpu_seg_override  __seg_gs
+#else
+#define __percpu_seg_override  __seg_fs
+#endif
+
+#define __percpu_prefix""
+#define __force_percpu_prefix  "%%"__stringify(__percpu_seg)":"
+
+#else /* USE_X86_SEG_SUPPORT */
+
+#define __percpu_seg_override
 #define __percpu_prefix"%%"__stringify(__percpu_seg)":"
+#define __force_percpu_prefix  __percpu_prefix
+
+#endif /* USE_X86_SEG_SUPPORT */
+
 #define __my_cpu_offsetthis_cpu_read(this_cpu_off)
 
 /*
@@ -58,14 +83,21 @@
unsigned long tcp_ptr__;\
asm volatile("add " __percpu_arg(1) ", %0"  \
 : "=r" (tcp_ptr__) \
-: "m" (this_cpu_off), "0" (ptr));  \
+: "m" (__my_cpu_var(this_cpu_off)),\
+  "0" (ptr));  \
(typeof(*(ptr)) __kernel __force *)tcp_ptr__;   \
 })
-#else
+#else /* CONFIG_SMP */
+#define __percpu_seg_override
 #define __percpu_prefix""
-#endif
+#define __force_percpu_prefix  ""
+#endif /* CONFIG_SMP */
 
+#define __my_cpu_type(var) typeof(var) __percpu_seg_override
+#define __my_cpu_ptr(ptr)  (__my_cpu_type(*ptr) *)(uintptr_t)(ptr)
+#define __my_cpu_var(var)  (*__my_cpu_ptr())
 #define __percpu_arg(x)__percpu_prefix "%" #x
+#define __force_percpu_arg(x)  __force_percpu_prefix "%" #x
 
 /*
  * Initialized pointers to per-cpu variables needed for the boot
@@ -98,22 +130,22 @@ do {   
\
switch (sizeof(var)) {  \
case 1: \
asm qual (op "b %1,"__percpu_arg(0) \
-   : "+m" (var)\
+   : "+m" (__my_cpu_var(var))  \
: "qi" ((pto_T__)(val)));   \
break;  \
case 2: \
asm qual (op "w %1,"__percpu_arg(0) \
-   : "+m" (var)\
+   : "+m" (__my_cpu_var(var))  \
: "ri" ((pto_T__)(val)));   \
break;  \
case 4: \
asm qual (op "l %1,"__percpu_arg(0) \
-   : "+m" (var)\
+   : "+m" (__my_cpu_var(var))  \
: "ri" ((pto_T__)(val)));   \
break;  \
case 8: \
asm qual (op "q %1,"__percpu_arg(0) \
-   : "+m" (var)\
+   : "+m" (__my_cpu_var(var))  \
: "re" ((pto_T__)(val)));   \
break;  \
default: __bad_percpu_size();   \
@@ -138,71 +170,79 @@ do {  
\
switch (sizeof(var)) {  \
case 1: \
   

[RFC 5/7] percpu: Assume preemption is disabled on per_cpu_ptr()

2019-07-18 Thread Nadav Amit
When per_cpu_ptr() is used, the caller should have preemption disabled,
as otherwise the pointer is meaningless. If the user wants an "unstable"
pointer he should call raw_cpu_ptr().

Add an assertion to check that indeed preemption is disabled, and
distinguish between the two cases to allow further, per-arch
optimizations.

Signed-off-by: Nadav Amit 
---
 include/asm-generic/percpu.h | 12 
 include/linux/percpu-defs.h  | 33 -
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/percpu.h b/include/asm-generic/percpu.h
index c2de013b2cf4..7853605f4210 100644
--- a/include/asm-generic/percpu.h
+++ b/include/asm-generic/percpu.h
@@ -36,6 +36,14 @@ extern unsigned long __per_cpu_offset[NR_CPUS];
 #define my_cpu_offset __my_cpu_offset
 #endif
 
+/*
+ * Determine the offset of the current active processor when preemption is
+ * disabled. Can be overriden by arch code.
+ */
+#ifndef __raw_my_cpu_offset
+#define __raw_my_cpu_offset __my_cpu_offset
+#endif
+
 /*
  * Arch may define arch_raw_cpu_ptr() to provide more efficient address
  * translations for raw_cpu_ptr().
@@ -44,6 +52,10 @@ extern unsigned long __per_cpu_offset[NR_CPUS];
 #define arch_raw_cpu_ptr(ptr) SHIFT_PERCPU_PTR(ptr, __my_cpu_offset)
 #endif
 
+#ifndef arch_raw_cpu_ptr_preemptable
+#define arch_raw_cpu_ptr_preemptable(ptr) SHIFT_PERCPU_PTR(ptr, 
__raw_my_cpu_offset)
+#endif
+
 #ifdef CONFIG_HAVE_SETUP_PER_CPU_AREA
 extern void setup_per_cpu_areas(void);
 #endif
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index a6fabd865211..13afca8a37e7 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -237,20 +237,51 @@ do {  
\
SHIFT_PERCPU_PTR((ptr), per_cpu_offset((cpu))); \
 })
 
+#ifndef arch_raw_cpu_ptr_preemption_disabled
+#define arch_raw_cpu_ptr_preemption_disabled(ptr)  \
+   arch_raw_cpu_ptr(ptr)
+#endif
+
+#define raw_cpu_ptr_preemption_disabled(ptr)   \
+({ \
+   __verify_pcpu_ptr(ptr); \
+   arch_raw_cpu_ptr_preemption_disabled(ptr);  \
+})
+
+/*
+ * If preemption is enabled, we need to read the pointer atomically on
+ * raw_cpu_ptr(). However if it is disabled, we can use the
+ * raw_cpu_ptr_nopreempt(), which is potentially more efficient. Similarly, we
+ * can use the preemption-disabled version if the kernel is non-preemptable or
+ * if voluntary preemption is used.
+ */
+#ifdef CONFIG_PREEMPT
+
 #define raw_cpu_ptr(ptr)   \
 ({ \
__verify_pcpu_ptr(ptr); \
arch_raw_cpu_ptr(ptr);  \
 })
 
+#else
+
+#define raw_cpu_ptr(ptr)   raw_cpu_ptr_preemption_disabled(ptr)
+
+#endif
+
 #ifdef CONFIG_DEBUG_PREEMPT
+/*
+ * Unlike other this_cpu_* operations, this_cpu_ptr() requires that preemption
+ * will be disabled. In contrast, raw_cpu_ptr() does not require that.
+ */
 #define this_cpu_ptr(ptr)  \
 ({ \
+   __this_cpu_preempt_check("ptr");\
__verify_pcpu_ptr(ptr); \
SHIFT_PERCPU_PTR(ptr, my_cpu_offset);   \
 })
 #else
-#define this_cpu_ptr(ptr) raw_cpu_ptr(ptr)
+#define this_cpu_ptr(ptr) raw_cpu_ptr_preemption_disabled(ptr)
 #endif
 
 #else  /* CONFIG_SMP */
-- 
2.17.1



[RFC 7/7] x86/current: Aggressive caching of current

2019-07-18 Thread Nadav Amit
The current_task is supposed to be constant in each thread and therefore
does not need to be reread. There is already an attempt to cache it
using inline assembly, using this_cpu_read_stable(), which hides the
dependency on the read memory address.

However, this caching is not working very well. For example,
sync_mm_rss() still reads current_task twice for no reason.

Allow more aggressive caching by aliasing current_task to
into a constant const_current_task and reading from the constant copy.
Doing so requires the compiler to support x86 segment qualifiers.
Hide const_current_task in a different compilation unit to avoid the
compiler from assuming that the value is constant during compilation.

Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/current.h | 30 ++
 arch/x86/kernel/cpu/Makefile   |  1 +
 arch/x86/kernel/cpu/common.c   |  7 +--
 arch/x86/kernel/cpu/current.c  | 16 
 include/linux/compiler.h   |  2 +-
 5 files changed, 49 insertions(+), 7 deletions(-)
 create mode 100644 arch/x86/kernel/cpu/current.c

diff --git a/arch/x86/include/asm/current.h b/arch/x86/include/asm/current.h
index 3e204e6140b5..7f093e81a647 100644
--- a/arch/x86/include/asm/current.h
+++ b/arch/x86/include/asm/current.h
@@ -10,11 +10,41 @@ struct task_struct;
 
 DECLARE_PER_CPU(struct task_struct *, current_task);
 
+#if USE_X86_SEG_SUPPORT
+
+/*
+ * Hold a constant alias for current_task, which would allow to avoid caching 
of
+ * current task.
+ *
+ * We must mark const_current_task with the segment qualifiers, as otherwise 
gcc
+ * would do redundant reads of const_current_task.
+ */
+DECLARE_PER_CPU(struct task_struct * const __percpu_seg_override, 
const_current_task);
+
+static __always_inline struct task_struct *get_current(void)
+{
+
+   /*
+* GCC is missing functionality of removing segment qualifiers, which
+* messes with per-cpu infrastructure that holds local copies. Use
+* __raw_cpu_read to avoid holding any copy.
+*/
+   return __raw_cpu_read(, const_current_task);
+}
+
+#else /* USE_X86_SEG_SUPPORT */
+
+/*
+ * Without segment qualifier support, the per-cpu infrastrucutre is not
+ * suitable for reading constants, so use this_cpu_read_stable() in this case.
+ */
 static __always_inline struct task_struct *get_current(void)
 {
return this_cpu_read_stable(current_task);
 }
 
+#endif /* USE_X86_SEG_SUPPORT */
+
 #define current get_current()
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/cpu/Makefile b/arch/x86/kernel/cpu/Makefile
index d7a1e5a9331c..d816f03a37d7 100644
--- a/arch/x86/kernel/cpu/Makefile
+++ b/arch/x86/kernel/cpu/Makefile
@@ -19,6 +19,7 @@ CFLAGS_common.o   := $(nostackp)
 
 obj-y  := cacheinfo.o scattered.o topology.o
 obj-y  += common.o
+obj-y  += current.o
 obj-y  += rdrand.o
 obj-y  += match.o
 obj-y  += bugs.o
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index e0489d2860d3..33a6b51e8059 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1607,13 +1607,8 @@ DEFINE_PER_CPU_FIRST(struct fixed_percpu_data,
 EXPORT_PER_CPU_SYMBOL_GPL(fixed_percpu_data);
 
 /*
- * The following percpu variables are hot.  Align current_task to
- * cacheline size such that they fall in the same cacheline.
+ * The following percpu variables are hot.
  */
-DEFINE_PER_CPU(struct task_struct *, current_task) cacheline_aligned =
-   _task;
-EXPORT_PER_CPU_SYMBOL(current_task);
-
 DEFINE_PER_CPU(struct irq_stack *, hardirq_stack_ptr);
 DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 
diff --git a/arch/x86/kernel/cpu/current.c b/arch/x86/kernel/cpu/current.c
new file mode 100644
index ..3238c6e34984
--- /dev/null
+++ b/arch/x86/kernel/cpu/current.c
@@ -0,0 +1,16 @@
+#include 
+#include 
+
+/*
+ * Align current_task to cacheline size such that they fall in the same
+ * cacheline.
+ */
+DEFINE_PER_CPU(struct task_struct *, current_task) cacheline_aligned =
+   _task;
+EXPORT_PER_CPU_SYMBOL(current_task);
+
+#if USE_X86_SEG_SUPPORT
+DECLARE_PER_CPU(struct task_struct * const __percpu_seg_override, 
const_current_task)
+   __attribute__((alias("current_task")));
+EXPORT_PER_CPU_SYMBOL(const_current_task);
+#endif
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index f0fd5636fddb..1b6ee9ab6373 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -299,7 +299,7 @@ unsigned long read_word_at_a_time(const void *addr)
  */
 #define __ADDRESSABLE(sym) \
static void * __section(".discard.addressable") __used \
-   __PASTE(__addressable_##sym, __LINE__) = (void *)
+   __PASTE(__addressable_##sym, __LINE__) = (void 
*)(uintptr_t)
 
 /**
  * offset_to_ptr - convert a relative memory offset to an absolute pointer
-- 
2.17.1



[PATCH v3 2/9] x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()

2019-07-18 Thread Nadav Amit
To use flush_tlb_func_local() as an argument to
__smp_call_function_many() we need it to have a single (void *)
parameter. Eliminate the second parameter and deduce the reason for the
flush.

Cc: Peter Zijlstra 
Cc: Dave Hansen 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 4de9704c4aaf..233f3d8308db 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -635,9 +635,12 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
this_cpu_write(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen, mm_tlb_gen);
 }
 
-static void flush_tlb_func_local(const void *info, enum tlb_flush_reason 
reason)
+static void flush_tlb_func_local(void *info)
 {
const struct flush_tlb_info *f = info;
+   enum tlb_flush_reason reason;
+
+   reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
 
flush_tlb_func_common(f, true, reason);
 }
@@ -790,7 +793,7 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long 
start,
if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
-   flush_tlb_func_local(info, TLB_LOCAL_MM_SHOOTDOWN);
+   flush_tlb_func_local(info);
local_irq_enable();
}
 
@@ -862,7 +865,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
*batch)
if (cpumask_test_cpu(cpu, >cpumask)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
-   flush_tlb_func_local(_flush_tlb_info, TLB_LOCAL_SHOOTDOWN);
+   flush_tlb_func_local(_flush_tlb_info);
local_irq_enable();
}
 
-- 
2.20.1



[PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-18 Thread Nadav Amit
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  8 ++---
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 47 ++-
 arch/x86/xen/mmu_pv.c | 11 +++
 include/trace/events/xen.h|  2 +-
 10 files changed, 62 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..8740d8b21db3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index dce26f1d13e1..8c6c2394393b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-   const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 639b2df445ee..c82969f38845 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,8 +211,8 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
-   void (*flush_tlb_others)(const struct cpumask *cpus,
-const struct flush_tlb_info *info);
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   const struct flush_tlb_info *info);
 
void (*tlb_remove_table)(struct mmu_

[PATCH v3 9/9] x86/mm/tlb: Remove unnecessary uses of the inline keyword

2019-07-18 Thread Nadav Amit
The compiler is smart enough without these hints.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Suggested-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 40daad52ec7d..2ddc32e787f3 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -189,7 +189,7 @@ static void sync_current_stack_to_mm(struct mm_struct *mm)
}
 }
 
-static inline unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
+static unsigned long mm_mangle_tif_spec_ib(struct task_struct *next)
 {
unsigned long next_tif = task_thread_info(next)->flags;
unsigned long ibpb = (next_tif >> TIF_SPEC_IB) & LAST_USER_MM_IBPB;
@@ -748,7 +748,7 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct flush_tlb_info, 
flush_tlb_info);
 static DEFINE_PER_CPU(unsigned int, flush_tlb_info_idx);
 #endif
 
-static inline struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
+static struct flush_tlb_info *get_flush_tlb_info(struct mm_struct *mm,
unsigned long start, unsigned long end,
unsigned int stride_shift, bool freed_tables,
u64 new_tlb_gen)
@@ -774,7 +774,7 @@ static inline struct flush_tlb_info 
*get_flush_tlb_info(struct mm_struct *mm,
return info;
 }
 
-static inline void put_flush_tlb_info(void)
+static void put_flush_tlb_info(void)
 {
 #ifdef CONFIG_DEBUG_VM
/* Complete reentrency prevention checks */
-- 
2.20.1



[PATCH v3 5/9] x86/mm/tlb: Privatize cpu_tlbstate

2019-07-18 Thread Nadav Amit
cpu_tlbstate is mostly private and only the variable is_lazy is shared.
This causes some false-sharing when TLB flushes are performed.

Break cpu_tlbstate intro cpu_tlbstate and cpu_tlbstate_shared, and mark
each one accordingly.

Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Signed-off-by: Nadav Amit 
---
 arch/x86/include/asm/tlbflush.h | 39 ++---
 arch/x86/mm/init.c  |  2 +-
 arch/x86/mm/tlb.c   | 15 -
 3 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 610e47dc66ef..8490591d3cdf 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -178,23 +178,6 @@ struct tlb_state {
u16 loaded_mm_asid;
u16 next_asid;
 
-   /*
-* We can be in one of several states:
-*
-*  - Actively using an mm.  Our CPU's bit will be set in
-*mm_cpumask(loaded_mm) and is_lazy == false;
-*
-*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
-*will not be set in mm_cpumask(_mm) and is_lazy == false.
-*
-*  - Lazily using a real mm.  loaded_mm != _mm, our bit
-*is set in mm_cpumask(loaded_mm), but is_lazy == true.
-*We're heuristically guessing that the CR3 load we
-*skipped more than makes up for the overhead added by
-*lazy mode.
-*/
-   bool is_lazy;
-
/*
 * If set we changed the page tables in such a way that we
 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
@@ -240,7 +223,27 @@ struct tlb_state {
 */
struct tlb_context ctxs[TLB_NR_DYN_ASIDS];
 };
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
+DECLARE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate);
+
+struct tlb_state_shared {
+   /*
+* We can be in one of several states:
+*
+*  - Actively using an mm.  Our CPU's bit will be set in
+*mm_cpumask(loaded_mm) and is_lazy == false;
+*
+*  - Not using a real mm.  loaded_mm == _mm.  Our CPU's bit
+*will not be set in mm_cpumask(_mm) and is_lazy == false.
+*
+*  - Lazily using a real mm.  loaded_mm != _mm, our bit
+*is set in mm_cpumask(loaded_mm), but is_lazy == true.
+*We're heuristically guessing that the CR3 load we
+*skipped more than makes up for the overhead added by
+*lazy mode.
+*/
+   bool is_lazy;
+};
+DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state_shared, cpu_tlbstate_shared);
 
 /*
  * Blindly accessing user memory from NMI context can be dangerous
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index fd10d91a6115..34027f36a944 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -951,7 +951,7 @@ void __init zone_sizes_init(void)
free_area_init_nodes(max_zone_pfns);
 }
 
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
.next_asid = 1,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 63c00908bdd9..af80c274c88d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -145,7 +145,7 @@ void leave_mm(int cpu)
return;
 
/* Warn if we're not lazy. */
-   WARN_ON(!this_cpu_read(cpu_tlbstate.is_lazy));
+   WARN_ON(!this_cpu_read(cpu_tlbstate_shared.is_lazy));
 
switch_mm(NULL, _mm, NULL);
 }
@@ -277,7 +277,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 {
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
u16 prev_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
-   bool was_lazy = this_cpu_read(cpu_tlbstate.is_lazy);
+   bool was_lazy = this_cpu_read(cpu_tlbstate_shared.is_lazy);
unsigned cpu = smp_processor_id();
u64 next_tlb_gen;
bool need_flush;
@@ -322,7 +322,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
-   this_cpu_write(cpu_tlbstate.is_lazy, false);
+   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
/*
 * The membarrier system call requires a full memory barrier and
@@ -463,7 +463,7 @@ void enter_lazy_tlb(struct mm_struct *mm, struct 
task_struct *tsk)
if (this_cpu_read(cpu_tlbstate.loaded_mm) == _mm)
return;
 
-   this_cpu_write(cpu_tlbstate.is_lazy, true);
+   this_cpu_write(cpu_tlbstate_shared.is_lazy, true);
 }
 
 /*
@@ -544,7 +544,7 @@ static void flush_tlb_func_common(const struct 
flush_tlb_info *f,
VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].ctx_id) !=
   loaded_mm->context.ctx

[PATCH v3 3/9] x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()

2019-07-18 Thread Nadav Amit
Open-code on_each_cpu_cond_mask() in native_flush_tlb_others() to
optimize the code. Open-coding eliminates the need for the indirect branch
that is used to call is_lazy(), and in CPUs that are vulnerable to
Spectre v2, it eliminates the retpoline. In addition, it allows to use a
preallocated cpumask to compute the CPUs that should be.

This would later allow us not to adapt on_each_cpu_cond_mask() to
support local and remote functions.

Note that calling tlb_is_not_lazy() for every CPU that needs to be
flushed, as done in native_flush_tlb_multi() might look ugly, but it is
equivalent to what is currently done in on_each_cpu_cond_mask().
Actually, native_flush_tlb_multi() does it more efficiently since it
avoids using an indirect branch for the matter.

Cc: Peter Zijlstra 
Cc: Dave Hansen 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 40 ++--
 1 file changed, 34 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 233f3d8308db..abbf55fa8b81 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -658,11 +658,13 @@ static void flush_tlb_func_remote(void *info)
flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
 }
 
-static bool tlb_is_not_lazy(int cpu, void *data)
+static bool tlb_is_not_lazy(int cpu)
 {
return !per_cpu(cpu_tlbstate.is_lazy, cpu);
 }
 
+static DEFINE_PER_CPU(cpumask_t, flush_tlb_mask);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info)
 {
@@ -706,12 +708,38 @@ void native_flush_tlb_others(const struct cpumask 
*cpumask,
 * up on the new contents of what used to be page tables, while
 * doing a speculative memory access.
 */
-   if (info->freed_tables)
+   if (info->freed_tables) {
smp_call_function_many(cpumask, flush_tlb_func_remote,
   (void *)info, 1);
-   else
-   on_each_cpu_cond_mask(tlb_is_not_lazy, flush_tlb_func_remote,
-   (void *)info, 1, GFP_ATOMIC, cpumask);
+   } else {
+   /*
+* Although we could have used on_each_cpu_cond_mask(),
+* open-coding it has performance advantages, as it eliminates
+* the need for indirect calls or retpolines. In addition, it
+* allows to use a designated cpumask for evaluating the
+* condition, instead of allocating one.
+*
+* This code works under the assumption that there are no nested
+* TLB flushes, an assumption that is already made in
+* flush_tlb_mm_range().
+*
+* cond_cpumask is logically a stack-local variable, but it is
+* more efficient to have it off the stack and not to allocate
+* it on demand. Preemption is disabled and this code is
+* non-reentrant.
+*/
+   struct cpumask *cond_cpumask = this_cpu_ptr(_tlb_mask);
+   int cpu;
+
+   cpumask_clear(cond_cpumask);
+
+   for_each_cpu(cpu, cpumask) {
+   if (tlb_is_not_lazy(cpu))
+   __cpumask_set_cpu(cpu, cond_cpumask);
+   }
+   smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
+(void *)info, 1);
+   }
 }
 
 /*
@@ -865,7 +893,7 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
*batch)
if (cpumask_test_cpu(cpu, >cpumask)) {
lockdep_assert_irqs_enabled();
local_irq_disable();
-   flush_tlb_func_local(_flush_tlb_info);
+   flush_tlb_func_local((void *)_flush_tlb_info);
local_irq_enable();
}
 
-- 
2.20.1



[PATCH v3 8/9] x86/mm/tlb: Remove UV special case

2019-07-18 Thread Nadav Amit
SGI UV support is outdated and not maintained, and it is not clear how
it performs relatively to non-UV. Remove the code to simplify the code.

Cc: Peter Zijlstra 
Cc: Dave Hansen 
Acked-by: Mike Travis 
Suggested-by: Andy Lutomirski 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 25 -
 1 file changed, 25 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 89f83ad19507..40daad52ec7d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -684,31 +684,6 @@ void native_flush_tlb_multi(const struct cpumask *cpumask,
trace_tlb_flush(TLB_REMOTE_SEND_IPI,
(info->end - info->start) >> PAGE_SHIFT);
 
-   if (is_uv_system()) {
-   /*
-* This whole special case is confused.  UV has a "Broadcast
-* Assist Unit", which seems to be a fancy way to send IPIs.
-* Back when x86 used an explicit TLB flush IPI, UV was
-* optimized to use its own mechanism.  These days, x86 uses
-* smp_call_function_many(), but UV still uses a manual IPI,
-* and that IPI's action is out of date -- it does a manual
-* flush instead of calling flush_tlb_func_remote().  This
-* means that the percpu tlb_gen variables won't be updated
-* and we'll do pointless flushes on future context switches.
-*
-* Rather than hooking native_flush_tlb_multi() here, I think
-* that UV should be updated so that smp_call_function_many(),
-* etc, are optimal on UV.
-*/
-   flush_tlb_func_local((void *)info);
-
-   cpumask = uv_flush_tlb_others(cpumask, info);
-   if (cpumask)
-   smp_call_function_many(cpumask, flush_tlb_func_remote,
-  (void *)info, 1);
-   return;
-   }
-
/*
 * If no page tables were freed, we can skip sending IPIs to
 * CPUs in lazy TLB mode. They will flush the CPU themselves
-- 
2.20.1



[PATCH v3 6/9] x86/mm/tlb: Do not make is_lazy dirty for no reason

2019-07-18 Thread Nadav Amit
Blindly writing to is_lazy for no reason, when the written value is
identical to the old value, makes the cacheline dirty for no reason.
Avoid making such writes to prevent cache coherency traffic for no
reason.

Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Suggested-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/mm/tlb.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index af80c274c88d..89f83ad19507 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -322,7 +322,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
-   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
+   if (was_lazy)
+   this_cpu_write(cpu_tlbstate_shared.is_lazy, false);
 
/*
 * The membarrier system call requires a full memory barrier and
-- 
2.20.1



[PATCH v3 7/9] cpumask: Mark functions as pure

2019-07-18 Thread Nadav Amit
cpumask_next_and() and cpumask_any_but() are pure, and marking them as
such seems to generate different and presumably better code for
native_flush_tlb_multi().

Signed-off-by: Nadav Amit 
---
 include/linux/cpumask.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index 0c7db5efe66c..6d57e6372b9d 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -211,7 +211,7 @@ static inline unsigned int cpumask_last(const struct 
cpumask *srcp)
return find_last_bit(cpumask_bits(srcp), nr_cpumask_bits);
 }
 
-unsigned int cpumask_next(int n, const struct cpumask *srcp);
+unsigned int __pure cpumask_next(int n, const struct cpumask *srcp);
 
 /**
  * cpumask_next_zero - get the next unset cpu in a cpumask
@@ -228,8 +228,8 @@ static inline unsigned int cpumask_next_zero(int n, const 
struct cpumask *srcp)
return find_next_zero_bit(cpumask_bits(srcp), nr_cpumask_bits, n+1);
 }
 
-int cpumask_next_and(int n, const struct cpumask *, const struct cpumask *);
-int cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
+int __pure cpumask_next_and(int n, const struct cpumask *, const struct 
cpumask *);
+int __pure cpumask_any_but(const struct cpumask *mask, unsigned int cpu);
 unsigned int cpumask_local_spread(unsigned int i, int node);
 
 /**
-- 
2.20.1



[PATCH v3 0/9] x86: Concurrent TLB flushes

2019-07-18 Thread Nadav Amit
[ Cover-letter is identical to v2, including benchmark results,
  excluding the change log. ] 

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions:

 sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
  --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   -   --  --
  1 1267765 (14146) 1299253 (5715)  +2.4%
  2 1734644 (11936) 1799225 (19577) +3.7%
  4 2821268 (41184) 2919132 (40149) +3.4%
  8 4171652 (31243) 4376925 (65416) +4.9%
  165590729 (24160) 5829866 (8127)  +4.2%
  246250212 (24481) 6522303 (28044) +4.3%
  323994314 (26606) 4077543 (10685) +2.0%
  484345177 (28091) 4417821 (41337) +1.6%

(Note that on configurations with up to 24 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   -   --  --
  1 1598896 (5174)  1607903 (4091)  +0.5%
  2 2109472 (17827) 2224726 (4372)  +5.4%
  4 3448587 (11952) 3668551 (30219) +6.3%
  8 5425778 (29641) 5606266 (33519) +3.3%
  166931232 (34677) 7054052 (27873) +1.7%
  247612473 (23482) 7783138 (13871) +2.2%
  324296274 (18029) 4283279 (32323) -0.3%
  484770029 (35541) 4764760 (13575) -0.1%

Presumably, PTI requires two invalidations of each mapping, which allows
to get higher benefits from concurrency when PTI is on. At the same
time, when mitigations are on, other overheads reduce the potential
speedup.

I tried to reduce the size of the code of the main patch, which required
restructuring of the series.

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: virtualizat...@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c |  10 +-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  47 -
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c |  11 ++-
 arch/x86/kernel/paravirt.c|   2 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 133 --
 arch/x86/xen/mmu_pv.c |  11 +--
 incl

[PATCH v3 1/9] smp: Run functions concurrently in smp_call_function_many()

2019-07-18 Thread Nadav Amit
Currently, on_each_cpu() and similar functions do not exploit the
potential of concurrency: the function is first executed remotely and
only then it is executed locally. Functions such as TLB flush can take
considerable time, so this provides an opportunity for performance
optimization.

To do so, introduce __smp_call_function_many(), which allows the callers
to provide local and remote functions that should be executed, and run
them concurrently. Keep smp_call_function_many() semantic as it is today
for backward compatibility: the called function is not executed in this
case locally.

__smp_call_function_many() does not use the optimized version for a
single remote target that smp_call_function_single() implements. For
synchronous function call, smp_call_function_single() keeps a
call_single_data (which is used for synchronization) on the stack.
Interestingly, it seems that not using this optimization provides
greater performance improvements (greater speedup with a single remote
target than with multiple ones). Presumably, holding data structures
that are intended for synchronization on the stack can introduce
overheads due to TLB misses and false-sharing when the stack is used for
other purposes.

Adding support to run the functions concurrently required to remove a
micro-optimization in on_each_cpu() that disabled/enabled IRQs instead
of saving/restoring them. The benefit of running the local and remote
code concurrently is expected to be greater.

Cc: Peter Zijlstra 
Cc: Dave Hansen 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: Andy Lutomirski 
Cc: Josh Poimboeuf 
Signed-off-by: Nadav Amit 
---
 include/linux/smp.h |  27 ++---
 kernel/smp.c| 133 +---
 2 files changed, 83 insertions(+), 77 deletions(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 6fc856c9eda5..c31c7cde0f77 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -32,11 +32,6 @@ extern unsigned int total_cpus;
 int smp_call_function_single(int cpuid, smp_call_func_t func, void *info,
 int wait);
 
-/*
- * Call a function on all processors
- */
-void on_each_cpu(smp_call_func_t func, void *info, int wait);
-
 /*
  * Call a function on processors specified by mask, which might include
  * the local one.
@@ -44,6 +39,15 @@ void on_each_cpu(smp_call_func_t func, void *info, int wait);
 void on_each_cpu_mask(const struct cpumask *mask, smp_call_func_t func,
void *info, bool wait);
 
+/*
+ * Call a function on all processors.  May be used during early boot while
+ * early_boot_irqs_disabled is set.
+ */
+static inline void on_each_cpu(smp_call_func_t func, void *info, int wait)
+{
+   on_each_cpu_mask(cpu_online_mask, func, info, wait);
+}
+
 /*
  * Call a function on each processor for which the supplied function
  * cond_func returns a positive value. This may include the local
@@ -102,8 +106,17 @@ extern void smp_cpus_done(unsigned int max_cpus);
  * Call a function on all other processors
  */
 void smp_call_function(smp_call_func_t func, void *info, int wait);
-void smp_call_function_many(const struct cpumask *mask,
-   smp_call_func_t func, void *info, bool wait);
+
+void __smp_call_function_many(const struct cpumask *mask,
+ smp_call_func_t remote_func,
+ smp_call_func_t local_func,
+ void *info, bool wait);
+
+static inline void smp_call_function_many(const struct cpumask *mask,
+   smp_call_func_t func, void *info, bool wait)
+{
+   __smp_call_function_many(mask, func, NULL, info, wait);
+}
 
 int smp_call_function_any(const struct cpumask *mask,
  smp_call_func_t func, void *info, int wait);
diff --git a/kernel/smp.c b/kernel/smp.c
index 616d4d114847..d5d8dee8e3f3 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -388,9 +388,13 @@ int smp_call_function_any(const struct cpumask *mask,
 EXPORT_SYMBOL_GPL(smp_call_function_any);
 
 /**
- * smp_call_function_many(): Run a function on a set of other CPUs.
+ * __smp_call_function_many(): Run a function on a set of CPUs.
  * @mask: The set of cpus to run on (only runs on online subset).
- * @func: The function to run. This must be fast and non-blocking.
+ * @remote_func: The function to run on remote cores. This must be fast and
+ *  non-blocking.
+ * @local_func: The function that should be run on this CPU. This must be
+ * fast and non-blocking. If NULL is provided, no function will
+ * be executed on this CPU.
  * @info: An arbitrary pointer to pass to the function.
  * @wait: If true, wait (atomically) until function has completed
  *on other CPUs.
@@ -401,11 +405,16 @@ EXPORT_SYMBOL_GPL(smp_call_function_any);
  * hardware interrupt handler or from a bottom half handler. Preemption
  * must be disabled when calling this function.
  */
-void smp_call_function_many

Re: [x86/modules] f2c65fb322: will-it-scale.per_process_ops -2.9% regression

2019-07-18 Thread Nadav Amit
> On Jul 18, 2019, at 2:50 AM, kernel test robot  wrote:
> 
> Greeting,
> 
> FYI, we noticed a -2.9% regression of will-it-scale.per_process_ops due to 
> commit:
> 
> 
> commit: f2c65fb3221adc6b73b0549fc7ba892022db9797 ("x86/modules: Avoid 
> breaking W^X while loading modules")
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkernel.googlesource.com%2Fpub%2Fscm%2Flinux%2Fkernel%2Fgit%2Ftorvalds%2Flinux.gitdata=02%7C01%7Cnamit%40vmware.com%7Cdfe19d72eecd46a93fc508d70b65652b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636990402460329367sdata=9MHPlha0MIkDWKe%2BurDDr0QRDMtJ3pPACgDtVXy8pL4%3Dreserved=0
>  master
> 
> in testcase: will-it-scale
> on test machine: 192 threads Intel(R) Xeon(R) CPU @ 2.20GHz with 192G memory
> with following parameters:
> 
>   nr_task: 100%
>   mode: process
>   test: poll1
>   cpufreq_governor: performance
> 
> test-description: Will It Scale takes a testcase and runs it from 1 through 
> to n parallel copies to see if the testcase will scale. It builds both a 
> process and threads based test in order to see any differences between the 
> two.
> test-url: 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fantonblanchard%2Fwill-it-scaledata=02%7C01%7Cnamit%40vmware.com%7Cdfe19d72eecd46a93fc508d70b65652b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C636990402460329367sdata=YmsJebf0t3yyo7KMCdUu6VIOysebmB%2Fc7huxkOe5cV0%3Dreserved=0

I don’t understand how this patch has any impact on this workload.

I ran it and set a function tracer for any function that is impacted by this
patch:

  # cd /sys/kernel/debug/tracing
  # echo text_poke_early > set_ftrace_filter
  # echo module_alloc >> set_ftrace_filter
  # echo bpf_int_jit_compile >> set_ftrace_filter
  # tail -f trace

Nothing came up. Can you please check if you see any of them invoked on your
setup? Perhaps you have some bpf filters being installed, although even then
this is a one-time (small) overhead for each process invocation.



Re: [PATCH v2] mm/balloon_compaction: avoid duplicate page removal

2019-07-18 Thread Nadav Amit
> On Jul 18, 2019, at 5:26 AM, Michael S. Tsirkin  wrote:
> 
> On Thu, Jul 18, 2019 at 05:27:20PM +0800, Wei Wang wrote:
>> Fixes: 418a3ab1e778 (mm/balloon_compaction: List interfaces)
>> 
>> A #GP is reported in the guest when requesting balloon inflation via
>> virtio-balloon. The reason is that the virtio-balloon driver has
>> removed the page from its internal page list (via balloon_page_pop),
>> but balloon_page_enqueue_one also calls "list_del"  to do the removal.
>> This is necessary when it's used from balloon_page_enqueue_list, but
>> not from balloon_page_enqueue_one.
>> 
>> So remove the list_del balloon_page_enqueue_one, and update some
>> comments as a reminder.
>> 
>> Signed-off-by: Wei Wang 
> 
> 
> ok I posted v3 with typo fixes. 1/2 is this patch with comment changes. Pls 
> take a look.

Thanks (Wei, Michael) for taking care of it. Please cc me on future
iterations of the patch.

Acked-by: Nadav Amit 



Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

2019-07-16 Thread Nadav Amit
> On Jul 16, 2019, at 3:20 PM, Dan Williams  wrote:
> 
> On Tue, Jul 16, 2019 at 3:13 PM Nadav Amit  wrote:
>>> On Jul 16, 2019, at 3:07 PM, Dan Williams  wrote:
>>> 
>>> On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton  
>>> wrote:
>>>> On Tue, 18 Jun 2019 21:56:43 + Nadav Amit  wrote:
>>>> 
>>>>>> ...and is constant for the life of the device and all subsequent 
>>>>>> mappings.
>>>>>> 
>>>>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot 
>>>>>>> (which I
>>>>>>> see being done in quite a few cases), but I don’t know the code well 
>>>>>>> enough
>>>>>>> to be certain that every vma should have a single protection and that it
>>>>>>> should not change afterwards.
>>>>>> 
>>>>>> No, I'm thinking this would naturally fit as a property hanging off a
>>>>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
>>>>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
>>>>>> saved value.
>>>>> 
>>>>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
>>>>> some free time). I still think that patch 2/3 is beneficial, but based on
>>>>> your feedback, patch 3/3 should be dropped.
>>>> 
>>>> It has been a while.  What should we do with
>>>> 
>>>> resource-fix-locking-in-find_next_iomem_res.patch
>>> 
>>> This one looks obviously correct to me, you can add:
>>> 
>>> Reviewed-by: Dan Williams 
>>> 
>>>> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
>>> 
>>> This one is a good bug report that we need to go fix pgprot lookups
>>> for dax, but I don't think we need to increase the trickiness of the
>>> core resource lookup code in the meantime.
>> 
>> I think that traversing big parts of the tree that are known to be
>> irrelevant is wasteful no matter what, and this code is used in other cases.
>> 
>> I don’t think the new code is so tricky - can you point to the part of the
>> code that you find tricky?
> 
> Given dax can be updated to avoid this abuse of find_next_iomem_res(),
> it was a general observation that the patch adds more lines than it
> removes and is not strictly necessary. I'm ambivalent as to whether it
> is worth pushing upstream. If anything the changelog is going to be
> invalidated by a change to dax to avoid find_next_iomem_res(). Can you
> update the changelog to be relevant outside of the dax case?

Well, 8 lines are comments, 4 are empty lines, so it adds 3 lines of code
according to my calculations. :)

Having said that, if you think I might have made a mistake, or you are
concerned with some bug I might have caused, please let me know. I
understand that this logic might have been lying around for some time.

I can update the commit log, emphasizing the redundant search operations as
motivation and then mentioning dax as an instance that induces overheads due
to this overhead, and say it should be handled regardless to this patch-set.
Once I find time, I am going to deal with DAX, unless you beat me to it.

Thanks,
Nadav

Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

2019-07-16 Thread Nadav Amit
> On Jul 16, 2019, at 3:07 PM, Dan Williams  wrote:
> 
> On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton  
> wrote:
>> On Tue, 18 Jun 2019 21:56:43 + Nadav Amit  wrote:
>> 
>>>> ...and is constant for the life of the device and all subsequent mappings.
>>>> 
>>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot 
>>>>> (which I
>>>>> see being done in quite a few cases), but I don’t know the code well 
>>>>> enough
>>>>> to be certain that every vma should have a single protection and that it
>>>>> should not change afterwards.
>>>> 
>>>> No, I'm thinking this would naturally fit as a property hanging off a
>>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
>>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
>>>> saved value.
>>> 
>>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
>>> some free time). I still think that patch 2/3 is beneficial, but based on
>>> your feedback, patch 3/3 should be dropped.
>> 
>> It has been a while.  What should we do with
>> 
>> resource-fix-locking-in-find_next_iomem_res.patch
> 
> This one looks obviously correct to me, you can add:
> 
> Reviewed-by: Dan Williams 
> 
>> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
> 
> This one is a good bug report that we need to go fix pgprot lookups
> for dax, but I don't think we need to increase the trickiness of the
> core resource lookup code in the meantime.

I think that traversing big parts of the tree that are known to be
irrelevant is wasteful no matter what, and this code is used in other cases.

I don’t think the new code is so tricky - can you point to the part of the
code that you find tricky?




Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

2019-07-16 Thread Nadav Amit
> On Jul 16, 2019, at 3:00 PM, Andrew Morton  wrote:
> 
> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit  wrote:
> 
>>> ...and is constant for the life of the device and all subsequent mappings.
>>> 
>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which 
>>>> I
>>>> see being done in quite a few cases), but I don’t know the code well enough
>>>> to be certain that every vma should have a single protection and that it
>>>> should not change afterwards.
>>> 
>>> No, I'm thinking this would naturally fit as a property hanging off a
>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
>>> saved value.
>> 
>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
>> some free time). I still think that patch 2/3 is beneficial, but based on
>> your feedback, patch 3/3 should be dropped.
> 
> It has been a while.  What should we do with
> 
> resource-fix-locking-in-find_next_iomem_res.patch
> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
> 
> ?

I didn’t get to follow Dan Williams advice. But, both of two patches are
fine on my opinion and should go upstream. The first one fixes a bug and the
second one improves performance considerably (and removes most of the
overhead). Future improvements can go on top of these patches and are not
expected to conflict.

So I think they should go upstream - the first one immediately, the second
one when possible.

Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Nadav Amit
> On Jul 15, 2019, at 4:30 PM, Andrew Cooper  wrote:
> 
> On 15/07/2019 19:17, Nadav Amit wrote:
>>> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  
>>> wrote:
>>> 
>>> There is a lot of infrastructure for functionality which is used
>>> exclusively in __{save,restore}_processor_state() on the suspend/resume
>>> path.
>>> 
>>> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
>>> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
>>> rest of the Local APIC state isn't a clever thing to be doing.
>>> 
>>> Delete the suspend/resume cr8 handling, which shrinks the size of struct
>>> saved_context, and allows for the removal of both PVOPS.
>> I think removing the interface for CR8 writes is also good to avoid
>> potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
>> Priorities between CR8 and APIC”):
>> 
>> "Operating software should implement either direct APIC TPR updates or CR8
>> style TPR updates but not mix them. Software can use a serializing
>> instruction (for example, CPUID) to serialize updates between MOV CR8 and
>> stores to the APIC.”
>> 
>> And native_write_cr8() did not even issue a serializing instruction.
> 
> Given its location, the one write_cr8() is bounded by two serialising
> operations, so is safe in practice.

That’s what the “potential” in "potential correctness issues” means :)



Re: [PATCH] vmw_balloon: Remove Julien from the maintainers list

2019-07-15 Thread Nadav Amit
> On Jul 14, 2019, at 6:41 AM, Julien Freche  wrote:
> 
> 
> 
> On Jul 14, 2019, at 12:18 AM, Nadav Amit  wrote:
> 
>>> On Jul 13, 2019, at 7:49 AM, Greg Kroah-Hartman 
>>>  wrote:
>>> 
>>>> On Tue, Jul 02, 2019 at 03:05:19AM -0700, Nadav Amit wrote:
>>>> Julien will not be a maintainer anymore.
>>>> 
>>>> Signed-off-by: Nadav Amit 
>>>> ---
>>>> MAINTAINERS | 1 -
>>>> 1 file changed, 1 deletion(-)
>>>> 
>>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>>> index 01a52fc964da..f85874b1e653 100644
>>>> --- a/MAINTAINERS
>>>> +++ b/MAINTAINERS
>>>> @@ -16886,7 +16886,6 @@ F:drivers/vme/
>>>> F:include/linux/vme*
>>>> 
>>>> VMWARE BALLOON DRIVER
>>>> -M:Julien Freche 
>>> 
>>> I need an ack/something from Julien please.
>> 
>> Julien, can I please have you Acked-by?
> 
> The change looks good to me, Nadav. Thanks for updating the maintainer list.

Thanks, Julien. Greg, do you want a v2 with an Acked-by?

[ Note that Julien's email address will be different than the one on the
MAINTAINER list ]

Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Nadav Amit
> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  wrote:
> 
> There is a lot of infrastructure for functionality which is used
> exclusively in __{save,restore}_processor_state() on the suspend/resume
> path.
> 
> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
> rest of the Local APIC state isn't a clever thing to be doing.
> 
> Delete the suspend/resume cr8 handling, which shrinks the size of struct
> saved_context, and allows for the removal of both PVOPS.

I think removing the interface for CR8 writes is also good to avoid
potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
Priorities between CR8 and APIC”):

"Operating software should implement either direct APIC TPR updates or CR8
style TPR updates but not mix them. Software can use a serializing
instruction (for example, CPUID) to serialize updates between MOV CR8 and
stores to the APIC.”

And native_write_cr8() did not even issue a serializing instruction.



Re: [PATCH] x86/apic: Initialize TPR to block interrupts 16-31

2019-07-14 Thread Nadav Amit
> On Jul 14, 2019, at 8:23 AM, Andy Lutomirski  wrote:
> 
> The APIC, per spec, is fundamentally confused and thinks that
> interrupt vectors 16-31 are valid.  This makes no sense -- the CPU
> reserves vectors 0-31 for exceptions (faults, traps, etc).
> Obviously, no device should actually produce an interrupt with
> vector 16-31, but we can improve robustness by setting the APIC TPR
> class to 1, which will prevent delivery of an interrupt with a
> vector below 32.
> 
> Note: this is *not* intended as a security measure against attackers
> who control malicious hardware.  Any PCI or similar hardware that
> can be controlled by an attacker MUST be behind a functional IOMMU
> that remaps interrupts.  The purpose of this patch is to reduce the
> chance that a certain class of device malfunctions crashes the
> kernel in hard-to-debug ways.
> 
> Cc: Nadav Amit 
> Cc: Stephane Eranian 
> Cc: Feng Tang 
> Suggested-by: Andrew Cooper 
> Signed-off-by: Andy Lutomirski 
> ---
> arch/x86/kernel/apic/apic.c | 7 +--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 177aa8ef2afa..ff31322f8839 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1531,11 +1531,14 @@ static void setup_local_APIC(void)
> #endif
> 
>   /*
> -  * Set Task Priority to 'accept all'. We never change this
> -  * later on.
> +  * Set Task Priority to 'accept all except vectors 0-31'.  An APIC
> +  * vector in the 16-31 range could be delivered if TPR == 0, but we
> +  * would think it's an exception and terrible things will happen.  We
> +  * never change this later on.
>*/
>   value = apic_read(APIC_TASKPRI);
>   value &= ~APIC_TPRI_MASK;
> + value |= 0x10;
>   apic_write(APIC_TASKPRI, value);
> 
>   apic_pending_intr_clear();

It looks fine, and indeed it seems that writes to APIC_TASKPRI and CR8 are
not overwriting this value.

Yet, the fact that if someone overwrites with zero (or does not restore) the
TPR will not produce any warning is not that great. Not that I know what the
right course of action is (adding checks in write_cr8()? but then what about
if APIC_TASKPRI is not restored after some APIC reset?)



Re: [PATCH] vmw_balloon: Remove Julien from the maintainers list

2019-07-14 Thread Nadav Amit
> On Jul 13, 2019, at 7:49 AM, Greg Kroah-Hartman  
> wrote:
> 
> On Tue, Jul 02, 2019 at 03:05:19AM -0700, Nadav Amit wrote:
>> Julien will not be a maintainer anymore.
>> 
>> Signed-off-by: Nadav Amit 
>> ---
>> MAINTAINERS | 1 -
>> 1 file changed, 1 deletion(-)
>> 
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 01a52fc964da..f85874b1e653 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -16886,7 +16886,6 @@ F:   drivers/vme/
>> F:   include/linux/vme*
>> 
>> VMWARE BALLOON DRIVER
>> -M:  Julien Freche 
> 
> I need an ack/something from Julien please.

Julien, can I please have you Acked-by?



Re: [GIT PULL] x86/topology changes for v5.3

2019-07-11 Thread Nadav Amit
> On Jul 11, 2019, at 8:08 AM, Kees Cook  wrote:
> 
> On Thu, Jul 11, 2019 at 10:01:34AM +0200, Peter Zijlstra wrote:
>> On Thu, Jul 11, 2019 at 07:11:19AM +0000, Nadav Amit wrote:
>>>> On Jul 10, 2019, at 7:22 AM, Jiri Kosina  wrote:
>>>> 
>>>> On Wed, 10 Jul 2019, Peter Zijlstra wrote:
>>>> 
>>>>> If we mark the key as RO after init, and then try and modify the key to
>>>>> link module usage sites, things might go bang as described.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> 
>>>>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>>>>> index 27d7864e7252..5bf7a8354da2 100644
>>>>> --- a/arch/x86/kernel/cpu/common.c
>>>>> +++ b/arch/x86/kernel/cpu/common.c
>>>>> @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct 
>>>>> cpuinfo_x86 *c)
>>>>>   cr4_clear_bits(X86_CR4_UMIP);
>>>>> }
>>>>> 
>>>>> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
>>>>> +DEFINE_STATIC_KEY_FALSE(cr_pinning);
>>>> 
>>>> Good catch, I guess that is going to fix it.
>>>> 
>>>> At the same time though, it sort of destroys the original intent of Kees' 
>>>> patch, right? The exploits will just have to call static_key_disable() 
>>>> prior to calling native_write_cr4() again, and the protection is gone.
>>> 
>>> Even with DEFINE_STATIC_KEY_FALSE_RO(), I presume you can just call
>>> set_memory_rw(), make the page that holds the key writable, and then call
>>> static_key_disable(), followed by a call to native_write_cr4().
>> 
>> Or call text_poke_bp() with the right set of arguments.
> 
> Right -- the point is to make it defended against an arbitrary write,
> not arbitrary execution. Nothing is safe from arbitrary exec, but we can
> do our due diligence on making things read-only.

I don’t understand.

In the PoC that motivated this this patch [1], the attacker gained the
ability to call a function, control its first argument and used it to
disable SMEP/SMAP by calling native_write_cr4(), which he did before doing
an arbitrary write (another ability he gain).

After this patch, the attacker can instead call three functions, and by
controlling their first arguments (*) disable SMEP. I didn’t see something
in the motivating PoC that prevented calling 3 functions one at a time.

So it seems to me that it raised the bar for an attack by very little.

--

(*) set_memory_rw() has a second argument - the number of pages - but many
non-zero values may be fine (if not a warning from __cpa_process_fault()
might appear).

[1] 
https://googleprojectzero.blogspot.com/2017/05/exploiting-linux-kernel-via-packet.html



Re: [GIT PULL] x86/topology changes for v5.3

2019-07-11 Thread Nadav Amit
> On Jul 10, 2019, at 7:22 AM, Jiri Kosina  wrote:
> 
> On Wed, 10 Jul 2019, Peter Zijlstra wrote:
> 
>> If we mark the key as RO after init, and then try and modify the key to
>> link module usage sites, things might go bang as described.
>> 
>> Thanks!
>> 
>> 
>> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
>> index 27d7864e7252..5bf7a8354da2 100644
>> --- a/arch/x86/kernel/cpu/common.c
>> +++ b/arch/x86/kernel/cpu/common.c
>> @@ -366,7 +366,7 @@ static __always_inline void setup_umip(struct 
>> cpuinfo_x86 *c)
>>  cr4_clear_bits(X86_CR4_UMIP);
>> }
>> 
>> -DEFINE_STATIC_KEY_FALSE_RO(cr_pinning);
>> +DEFINE_STATIC_KEY_FALSE(cr_pinning);
> 
> Good catch, I guess that is going to fix it.
> 
> At the same time though, it sort of destroys the original intent of Kees' 
> patch, right? The exploits will just have to call static_key_disable() 
> prior to calling native_write_cr4() again, and the protection is gone.

Even with DEFINE_STATIC_KEY_FALSE_RO(), I presume you can just call
set_memory_rw(), make the page that holds the key writable, and then call
static_key_disable(), followed by a call to native_write_cr4().

Re: [PATCH v2 8/9] x86/mm/tlb: Remove UV special case

2019-07-09 Thread Nadav Amit
> On Jul 9, 2019, at 1:29 PM, Mike Travis  wrote:
> 
> 
> 
> On 7/9/2019 1:09 PM, Russ Anderson wrote:
>> On Tue, Jul 09, 2019 at 09:50:27PM +0200, Thomas Gleixner wrote:
>>> On Tue, 2 Jul 2019, Nadav Amit wrote:
>>> 
>>>> SGI UV support is outdated and not maintained, and it is not clear how
>>>> it performs relatively to non-UV. Remove the code to simplify the code.
>>> 
>>> You should at least Cc the SGI/HP folks on that. They are still
>>> around. Done so.
>> Thanks Thomas.  The SGI UV is now HPE Superdome Flex and is
>> very much still supported.
>> Thanks.
>>>> Cc: Peter Zijlstra 
>>>> Cc: Dave Hansen 
>>>> Suggested-by: Andy Lutomirski 
>>>> Signed-off-by: Nadav Amit 
>>>> ---
>>>>  arch/x86/mm/tlb.c | 25 -
>>>>  1 file changed, 25 deletions(-)
>>>> 
>>>> diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
>>>> index b47a71820f35..64afe1215495 100644
>>>> --- a/arch/x86/mm/tlb.c
>>>> +++ b/arch/x86/mm/tlb.c
>>>> @@ -689,31 +689,6 @@ void native_flush_tlb_multi(const struct cpumask 
>>>> *cpumask,
>>>>trace_tlb_flush(TLB_REMOTE_SEND_IPI,
>>>>(info->end - info->start) >> PAGE_SHIFT);
>>>>  - if (is_uv_system()) {
>>>> -  /*
>>>> -   * This whole special case is confused.  UV has a "Broadcast
>>>> -   * Assist Unit", which seems to be a fancy way to send IPIs.
>>>> -   * Back when x86 used an explicit TLB flush IPI, UV was
>>>> -   * optimized to use its own mechanism.  These days, x86 uses
>>>> -   * smp_call_function_many(), but UV still uses a manual IPI,
>>>> -   * and that IPI's action is out of date -- it does a manual
>>>> -   * flush instead of calling flush_tlb_func_remote().  This
>>>> -   * means that the percpu tlb_gen variables won't be updated
>>>> -   * and we'll do pointless flushes on future context switches.
>>>> -   *
>>>> -   * Rather than hooking native_flush_tlb_multi() here, I think
>>>> -   * that UV should be updated so that smp_call_function_many(),
>>>> -   * etc, are optimal on UV.
>>>> -   */
> 
> I thought this change was already proposed a bit ago and we acked it
> awhile back. Also the replacement functionality is being worked on but it
> is more complex. The smp call many has to support all the reasons why it’s
> called and not just the tlb shoot downs as is the current BAU case.

Sorry for not cc’ing you before. In the meanwhile, can you give an explicit
acked-by? (I couldn’t find the previous patch you regarded.)

Thanks,
Nadav

Re: linux-next: manual merge of the char-misc tree with the driver-core tree

2019-07-08 Thread Nadav Amit
> On Jul 8, 2019, at 4:20 PM, Stephen Rothwell  wrote:
> 
> Hi all,
> 
> On Thu, 13 Jun 2019 15:53:44 +1000 Stephen Rothwell  
> wrote:
>> Today's linux-next merge of the char-misc tree got a conflict in:
>> 
>>  drivers/misc/vmw_balloon.c
>> 
>> between commit:
>> 
>>  225afca60b8a ("vmw_balloon: no need to check return value of debugfs_create 
>> functions")
>> 
>> from the driver-core tree and commits:
>> 
>>  83a8afa72e9c ("vmw_balloon: Compaction support")
>>  5d1a86ecf328 ("vmw_balloon: Add memory shrinker")
>> 
>> from the char-misc tree.
>> 
>> I fixed it up (see below) and can carry the fix as necessary. This
>> is now fixed as far as linux-next is concerned, but any non trivial
>> conflicts should be mentioned to your upstream maintainer when your tree
>> is submitted for merging.  You may also want to consider cooperating
>> with the maintainer of the conflicting tree to minimise any particularly
>> complex conflicts.
>> 
>> -- 
>> Cheers,
>> Stephen Rothwell
>> 
>> diff --cc drivers/misc/vmw_balloon.c
>> index fdf5ad757226,043eed845246..
>> --- a/drivers/misc/vmw_balloon.c
>> +++ b/drivers/misc/vmw_balloon.c
>> @@@ -1553,15 -1942,26 +1932,24 @@@ static int __init vmballoon_init(void
>>  if (x86_hyper_type != X86_HYPER_VMWARE)
>>  return -ENODEV;
>> 
>> -for (page_size = VMW_BALLOON_4K_PAGE;
>> - page_size <= VMW_BALLOON_LAST_SIZE; page_size++)
>> -INIT_LIST_HEAD(_sizes[page_size].pages);
>> - 
>> - 
>>  INIT_DELAYED_WORK(, vmballoon_work);
>> 
>> +error = vmballoon_register_shrinker();
>> +if (error)
>> +goto fail;
>> + 
>> -error = vmballoon_debugfs_init();
>> -if (error)
>> -goto fail;
>> +vmballoon_debugfs_init();
>> 
>> +/*
>> + * Initialization of compaction must be done after the call to
>> + * balloon_devinfo_init() .
>> + */
>> +balloon_devinfo_init(_dev_info);
>> +error = vmballoon_compaction_init();
>> +if (error)
>> +goto fail;
>> + 
>> +INIT_LIST_HEAD(_pages);
>>  spin_lock_init(_lock);
>>  init_rwsem(_sem);
>>  balloon.vmci_doorbell = VMCI_INVALID_HANDLE;
> 
> I am still getting this conflict (the commit ids may have changed).
> Just a reminder in case you think Linus may need to know.

Greg,

Can you please take care of it, as you are the maintainer of char-misc
and the committer of the patch?





Re: [patch V2 04/25] x86/apic: Make apic_pending_intr_clear() more robust

2019-07-05 Thread Nadav Amit
> On Jul 5, 2019, at 8:47 AM, Andrew Cooper  wrote:
> 
> On 04/07/2019 16:51, Thomas Gleixner wrote:
>>  2) The loop termination logic is interesting at best.
>> 
>> If the machine has no TSC or cpu_khz is not known yet it tries 1
>> million times to ack stale IRR/ISR bits. What?
>> 
>> With TSC it uses the TSC to calculate the loop termination. It takes a
>> timestamp at entry and terminates the loop when:
>> 
>>(rdtsc() - start_timestamp) >= (cpu_hkz << 10)
>> 
>> That's roughly one second.
>> 
>> Both methods are problematic. The APIC has 256 vectors, which means
>> that in theory max. 256 IRR/ISR bits can be set. In practice this is
>> impossible as the first 32 vectors are reserved and not affected and
>> the chance that more than a few bits are set is close to zero.
> 
> [Disclaimer.  I talked to Thomas in private first, and he asked me to
> post this publicly as the CVE is almost a decade old already.]
> 
> I'm afraid that this isn't quite true.
> 
> In terms of IDT vectors, the first 32 are reserved for exceptions, but
> only the first 16 are reserved in the LAPIC.  Vectors 16-31 are fair
> game for incoming IPIs (SDM Vol3, 10.5.2 Valid Interrupt Vectors).
> 
> In practice, this makes Linux vulnerable to CVE-2011-1898 / XSA-3, which
> I'm disappointed to see wasn't shared with other software vendors at the
> time.

IIRC (and from skimming the CVE again) the basic problem in Xen was that
MSIs can be used when devices are assigned to generate IRQs with arbitrary
vectors. The mitigation was to require interrupt remapping to be enabled in
the IOMMU when IOMMU is used for DMA remapping (i.e., device assignment).

Are you concerned about this case, additional concrete ones, or is it about
security hardening? (or am I missing something?)

Re: [PATCH] KVM: LAPIC: ARBPRI is a reserved register for x2APIC

2019-07-05 Thread Nadav Amit
> On Jul 5, 2019, at 6:43 AM, Paolo Bonzini  wrote:
> 
> On 05/07/19 15:37, Nadav Amit wrote:
>>> On Jul 5, 2019, at 5:14 AM, Paolo Bonzini  wrote:
>>> 
>>> kvm-unit-tests were adjusted to match bare metal behavior, but KVM
>>> itself was not doing what bare metal does; fix that.
>>> 
>>> Signed-off-by: Paolo Bonzini 
>> 
>> Reported-by ?
> 
> I found it myself while running the tests, was there a report too?

Perhaps it is in my mind. I thought that fixing a test is equivalent to a
bug report.



Re: [PATCH] KVM: LAPIC: ARBPRI is a reserved register for x2APIC

2019-07-05 Thread Nadav Amit
> On Jul 5, 2019, at 5:14 AM, Paolo Bonzini  wrote:
> 
> kvm-unit-tests were adjusted to match bare metal behavior, but KVM
> itself was not doing what bare metal does; fix that.
> 
> Signed-off-by: Paolo Bonzini 

Reported-by ?


Re: [patch V2 21/25] x86/smp: Enhance native_send_call_func_ipi()

2019-07-04 Thread Nadav Amit
> On Jul 4, 2019, at 8:52 AM, Thomas Gleixner  wrote:
> 
> Nadav noticed that the cpumask allocations in native_send_call_func_ipi()
> are noticeable in microbenchmarks.
> 
> Use the new cpumask_or_equal() function to simplify the decision whether
> the supplied target CPU mask is either equal to cpu_online_mask or equal to
> cpu_online_mask except for the CPU on which the function is invoked.
> 
> cpumask_or_equal() or's the target mask and the cpumask of the current CPU
> together and compares it to cpu_online_mask.
> 
> If the result is false, use the mask based IPI function, otherwise check
> whether the current CPU is set in the target mask and invoke either the
> send_IPI_all() or the send_IPI_allbutselt() APIC callback.
> 
> Make the shorthand decision also depend on the static key which enables
> shorthand mode. That allows to remove the extra cpumask comparison with
> cpu_callout_mask.
> 
> Reported-by: Nadav Amit 
> Signed-off-by: Thomas Gleixner 
> ---
> V2: New patch
> ---
> arch/x86/kernel/apic/ipi.c |   24 +++-
> 1 file changed, 11 insertions(+), 13 deletions(-)
> 
> --- a/arch/x86/kernel/apic/ipi.c
> +++ b/arch/x86/kernel/apic/ipi.c
> @@ -83,23 +83,21 @@ void native_send_call_func_single_ipi(in
> 
> void native_send_call_func_ipi(const struct cpumask *mask)
> {
> - cpumask_var_t allbutself;
> + if (static_branch_likely(_use_ipi_shorthand)) {
> + unsigned int cpu = smp_processor_id();
> 
> - if (!alloc_cpumask_var(, GFP_ATOMIC)) {
> - apic->send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
> + if (!cpumask_or_equal(mask, cpumask_of(cpu), cpu_online_mask))
> + goto sendmask;
> +
> + if (cpumask_test_cpu(cpu, mask))
> + apic->send_IPI_all(CALL_FUNCTION_VECTOR);
> + else if (num_online_cpus() > 1)
> + apic->send_IPI_allbutself(CALL_FUNCTION_VECTOR);
>   return;
>   }
> 
> - cpumask_copy(allbutself, cpu_online_mask);
> - cpumask_clear_cpu(smp_processor_id(), allbutself);
> -
> - if (cpumask_equal(mask, allbutself) &&
> - cpumask_equal(cpu_online_mask, cpu_callout_mask))
> - apic->send_IPI_allbutself(CALL_FUNCTION_VECTOR);
> - else
> - apic->send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
> -
> - free_cpumask_var(allbutself);
> +sendmask:
> + apic->send_IPI_mask(mask, CALL_FUNCTION_VECTOR);
> }
> 
> #endif /* CONFIG_SMP */

It does look better and simpler than my solution.



<    1   2   3   4   5   6   7   8   9   10   >