Re: [PATCH 2/2] x86/paravirt: Make the struct paravirt_patch_site packed

2023-06-09 Thread Nadav Amit via Virtualization


> On Jun 9, 2023, at 2:45 AM, Hou Wenlong  wrote:
> 
> Similar to struct alt_instr, make the struct paravirt_patch_site packed
> and get rid of all the .align directives. This could save 2 bytes for
> one entry on X86_64.

Thanks for taking care of it.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction

2023-06-05 Thread Nadav Amit via Virtualization



> On Jun 1, 2023, at 2:29 AM, Juergen Gross  wrote:
> 
> On 28.04.23 11:50, Hou Wenlong wrote:
>> Similar to the alternative patching, use relative reference for original
>> instruction rather than absolute one, which saves 8 bytes for one entry
>> on x86_64.  And it could generate R_X86_64_PC32 relocation instead of
>> R_X86_64_64 relocation, which also reduces relocation metadata on
>> relocatable builds. And the alignment could be hard coded to be 4 now.
>> Signed-off-by: Hou Wenlong 
>> Cc: Thomas Garnier 
>> Cc: Lai Jiangshan 
>> Cc: Kees Cook 
> 
> Reviewed-by: Juergen Gross 
> 
> I think this patch should be taken even without the series.

It looks good to me, I am just not sure what the alignment is needed
at all.

Why not to make the struct __packed (like struct alt_instr) and get rid
of all the .align directives? Am I missing something?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] MAINTAINERS: Change status of some VMware drivers

2022-09-06 Thread Nadav Amit via Virtualization
On Sep 6, 2022, at 10:27 AM, Vishnu Dasa  wrote:

> From: Vishnu Dasa 
> 
> Change the status from 'Maintained' to 'Supported' for VMWARE
> BALLOON DRIVER, VMWARE PVRDMA DRIVER, VMWARE PVSCSI driver,
> VMWARE VMCI DRIVER, VMWARE VMMOUSE SUBDRIVER and VMWARE VMXNET3
> ETHERNET DRIVER.
> 
> This needs to be done to conform to the guidelines in [1].
> Maintainers for these drivers are VMware employees.
> 
> [1] https://docs.kernel.org/process/maintainers.html
> 
> Signed-off-by: Vishnu Dasa 
> ---
> MAINTAINERS | 12 ++--
> 1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index b75eb23a099b..5a634b5d6f6c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -21812,7 +21812,7 @@ VMWARE BALLOON DRIVER
> M:Nadav Amit 
> R:VMware PV-Drivers Reviewers 
> L:linux-ker...@vger.kernel.org
> -S:   Maintained
> +S:   Supported
> F:drivers/misc/vmw_balloon.c

Acked-by: Nadav Amit 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 1/4] Make place for common balloon code

2022-08-22 Thread Nadav Amit via Virtualization
On Aug 22, 2022, at 4:37 AM, Alexander Atanasov 
 wrote:

> mm/balloon_compaction.c -> mm/balloon.c
> File already contains code that is common along balloon
> drivers so rename it to reflect its contents.
> 
> include/linux/balloon_compaction.h -> include/linux/balloon.h
> Remove it from files which do not actually use it.
> Drop externs from function delcarations.
> 
> Signed-off-by: Alexander Atanasov 

Makes so much sense.

Acked-by: Nadav Amit 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Potential deadlock in Linux VMCI driver

2022-08-16 Thread Nadav Amit via Virtualization
On Aug 16, 2022, at 11:23 AM, Nadav Amit  wrote:

> During my development, I enabled some Linux kernel checkers, specifically
> the “sleep in atomic” checker.
> 
> I ran into unrelated issue that appears to be a result of commit
> 463713eb6164b6 ("VMCI: dma dg: add support for DMA datagrams receive”).
> IIUC, vmci_read_data() calls wait_event(), which is not allowed while IRQs
> are disabled, which they are during IRQ handling.

Just minor correction of myself: IRQ are not disabled, preemption is
disabled.

> 
> [   22.653012] Preemption disabled at:
> [   22.653017] __do_softirq (kernel/softirq.c:504 kernel/softirq.c:548) 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Potential deadlock in Linux VMCI driver

2022-08-16 Thread Nadav Amit via Virtualization
During my development, I enabled some Linux kernel checkers, specifically
the “sleep in atomic” checker.

I ran into unrelated issue that appears to be a result of commit
463713eb6164b6 ("VMCI: dma dg: add support for DMA datagrams receive”).
IIUC, vmci_read_data() calls wait_event(), which is not allowed while IRQs
are disabled, which they are during IRQ handling.

I think "CONFIG_DEBUG_ATOMIC_SLEEP=y" is the one that triggers the warning
below, which indicates a deadlock is possible.

The splat below (after decoding) was experienced on Linux 5.19. Let me know
if you need me to open a bug in bugzilla or whether this issue is already
known.


[   22.629691] BUG: sleeping function called from invalid context at 
drivers/misc/vmw_vmci/vmci_guest.c:145
[   22.633894] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 775, 
name: cloud-init
[   22.638232] preempt_count: 100, expected: 0
[   22.641887] RCU nest depth: 0, expected: 0
[   22.645461] 1 lock held by cloud-init/775:
[   22.649013] #0: 88810e057200 (>i_mutex_dir_key#6){}-{3:3}, at: 
iterate_dir (fs/readdir.c:46) 
[   22.653012] Preemption disabled at:
[   22.653017] __do_softirq (kernel/softirq.c:504 kernel/softirq.c:548) 
[   22.660264] CPU: 3 PID: 775 Comm: cloud-init Not tainted 5.19.0+ #3
[   22.664004] Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference 
Platform, BIOS VMW201.00V.20253199.B64.2208081742 08/08/2022
[   22.671600] Call Trace:
[   22.675165]  
[   22.678681] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) 
[   22.682303] dump_stack (lib/dump_stack.c:114) 
[   22.685883] __might_resched.cold (kernel/sched/core.c:9822) 
[   22.689500] __might_sleep (kernel/sched/core.c:9751 (discriminator 14)) 
[   22.692961] vmci_read_data (./include/linux/kernel.h:110 
drivers/misc/vmw_vmci/vmci_guest.c:145) vmw_vmci
[   22.696461] ? vmci_interrupt_bm (drivers/misc/vmw_vmci/vmci_guest.c:121) 
vmw_vmci
[   22.699920] ? __this_cpu_preempt_check (lib/smp_processor_id.c:67) 
[   22.703305] ? wake_up_var (./include/linux/list.h:292 
./include/linux/wait.h:129 kernel/sched/wait_bit.c:125 
kernel/sched/wait_bit.c:193) 
[   22.706526] ? cpuusage_read (kernel/sched/wait_bit.c:192) 
[   22.709682] ? mark_held_locks (kernel/locking/lockdep.c:4234) 
[   22.712779] vmci_dispatch_dgs (drivers/misc/vmw_vmci/vmci_guest.c:332) 
vmw_vmci
[   22.715923] tasklet_action_common.constprop.0 (kernel/softirq.c:799) 
[   22.719008] ? vmci_read_data (drivers/misc/vmw_vmci/vmci_guest.c:308) 
vmw_vmci
[   22.722018] tasklet_action (kernel/softirq.c:819) 
[   22.724865] __do_softirq (kernel/softirq.c:571) 
[   22.727650] __irq_exit_rcu (kernel/softirq.c:445 kernel/softirq.c:650) 
[   22.730348] irq_exit_rcu (kernel/softirq.c:664) 
[   22.732947] common_interrupt (arch/x86/kernel/irq.c:240 (discriminator 14)) 
[   22.735513]  
[   22.737879]  
[   22.740141] asm_common_interrupt (./arch/x86/include/asm/idtentry.h:640) 
[   22.742498] RIP: 0010:stack_trace_consume_entry (kernel/stacktrace.c:83) 
[ 22.744891] Code: be 80 01 00 00 48 c7 c7 40 82 cd 82 48 89 e5 e8 7d 38 53 00 
5d c3 cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 <41> 55 49 89 
f5 41 54 53 48 89 fb 48 83 c7 10 e8 23 e0 36 00 48 8d
All code

   0:   be 80 01 00 00  mov$0x180,%esi
   5:   48 c7 c7 40 82 cd 82mov$0x82cd8240,%rdi
   c:   48 89 e5mov%rsp,%rbp
   f:   e8 7d 38 53 00  call   0x533891
  14:   5d  pop%rbp
  15:   c3  ret
  16:   cc  int3   
  17:   cc  int3   
  18:   cc  int3   
  19:   cc  int3   
  1a:   cc  int3   
  1b:   cc  int3   
  1c:   cc  int3   
  1d:   cc  int3   
  1e:   cc  int3   
  1f:   cc  int3   
  20:   cc  int3   
  21:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
  26:   55  push   %rbp
  27:   48 89 e5mov%rsp,%rbp
  2a:*  41 55   push   %r13 <-- trapping instruction
  2c:   49 89 f5mov%rsi,%r13
  2f:   41 54   push   %r12
  31:   53  push   %rbx
  32:   48 89 fbmov%rdi,%rbx
  35:   48 83 c7 10 add$0x10,%rdi
  39:   e8 23 e0 36 00  call   0x36e061
  3e:   48  rex.W
  3f:   8d  .byte 0x8d

Code starting with the faulting instruction
===
   0:   41 55   push   %r13
   2:   49 89 f5mov%rsi,%r13
   5:   41 54   push   %r12
   7:   53  push   %rbx
   8:   48 89 fbmov%rdi,%rbx
   b:   48 83 c7 10 add$0x10,%rdi
   f:   e8 23 e0 36 00  call   0x36e037
  14:   48 

Re: [PATCH v1 2/2] Drivers: virtio: balloon: Report inflated memory

2022-08-15 Thread Nadav Amit via Virtualization
On Aug 15, 2022, at 5:52 AM, Alexander Atanasov 
 wrote:

> ⚠ External Email
> 
> Hi,
> 
> On 9.08.22 20:44, Nadav Amit wrote:
>> On Aug 9, 2022, at 2:53 AM, Alexander Atanasov 
>>  wrote:
>> 
>>> Update the value in page_alloc on balloon fill/leak.
>> 
>> Some general comments if this patch goes forward.
>> 
>> Please cc pv-driv...@vmware.com in the future.
> 
> Ok.
> 
>>> Cc: David Hildenbrand 
>>> Cc: Wei Liu 
>>> Cc: Nadav Amit 
>>> 
>>> Signed-off-by: Alexander Atanasov 
>>> ---
>>> drivers/virtio/virtio_balloon.c | 13 +
>>> 1 file changed, 13 insertions(+)
>>> 
>>> Firts user, other balloons i will do if it is accepted to avoid too much 
>>> emails.
>>> 
>>> 
>>> diff --git a/drivers/virtio/virtio_balloon.c 
>>> b/drivers/virtio/virtio_balloon.c
>>> index b9737da6c4dd..e2693ffbd48b 100644
>>> --- a/drivers/virtio/virtio_balloon.c
>>> +++ b/drivers/virtio/virtio_balloon.c
>>> @@ -208,6 +208,16 @@ static void set_page_pfns(struct virtio_balloon *vb,
>>>   page_to_balloon_pfn(page) + i);
>>> }
>>> 
>>> +static void update_meminfo(struct virtio_balloon *vb)
>> 
>> Putting aside the less-than-optimal function name, I would like to ask that
> 
> Right, i will think of a better one.
> 
>> any new generic balloon logic would go into balloon_compaction.[hc] as much
> 
> If it is going to be a place for generic logic may be it should be
> renamed to balloon_common.[ch] ?
> 
>> as possible. I made the effort to reuse this infrastructure (which is now
>> used by both VMware and virtio), and would prefer to share as much code as
>> possible.
>> 
>> For instance, I would appreciate if the update upon inflate would go into
>> balloon_page_list_enqueue() and balloon_page_enqueue(). VMware's 2MB pages
>> logic is not shared, so it would require a change that is specific for
>> VMware code.
> 
> I looked at the code and i do not see how i can reuse it since
> BALLOON_COMPACTION can be disabled and as you say even for VMWare it
> would require updates on other places. Looks like if i do so i would
> have to handle update from each driver for both cases. I think it is
> better to clearly mark the spots when drivers do their internal
> recalculations and report to the core. I see only VMWare is using
> balloon_page_list_enqueue , virtio balloon is using only migration and
> i don't see how to hook it there - i haven't checked the rest of the
> balloons but i guess it would be similiar . I agree it is a good to have
> a common place for such logic but it might be better of for a separate
> work in the future.

Fine. I would live with whatever you do and if needed change it later.

Thanks for considering.

Regards,
Nadav
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 2/2] Drivers: virtio: balloon: Report inflated memory

2022-08-09 Thread Nadav Amit via Virtualization


On Aug 9, 2022, at 2:53 AM, Alexander Atanasov 
 wrote:

> Update the value in page_alloc on balloon fill/leak.

Some general comments if this patch goes forward.

Please cc pv-driv...@vmware.com in the future.

> 
> Cc: David Hildenbrand 
> Cc: Wei Liu 
> Cc: Nadav Amit 
> 
> Signed-off-by: Alexander Atanasov 
> ---
> drivers/virtio/virtio_balloon.c | 13 +
> 1 file changed, 13 insertions(+)
> 
> Firts user, other balloons i will do if it is accepted to avoid too much 
> emails.
> 
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b9737da6c4dd..e2693ffbd48b 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -208,6 +208,16 @@ static void set_page_pfns(struct virtio_balloon *vb,
> page_to_balloon_pfn(page) + i);
> }
> 
> +static void update_meminfo(struct virtio_balloon *vb)

Putting aside the less-than-optimal function name, I would like to ask that
any new generic balloon logic would go into balloon_compaction.[hc] as much
as possible. I made the effort to reuse this infrastructure (which is now
used by both VMware and virtio), and would prefer to share as much code as
possible.

For instance, I would appreciate if the update upon inflate would go into
balloon_page_list_enqueue() and balloon_page_enqueue(). VMware's 2MB pages
logic is not shared, so it would require a change that is specific for
VMware code.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Pv-drivers] [PATCH 29/36] cpuidle, xenpv: Make more PARAVIRT_XXL noinstr clean

2022-06-13 Thread Nadav Amit via Virtualization
On Jun 13, 2022, at 11:48 AM, Srivatsa S. Bhat  wrote:

> ⚠ External Email
> 
> On 6/8/22 4:27 PM, Peter Zijlstra wrote:
>> vmlinux.o: warning: objtool: acpi_idle_enter_s2idle+0xde: call to wbinvd() 
>> leaves .noinstr.text section
>> vmlinux.o: warning: objtool: default_idle+0x4: call to arch_safe_halt() 
>> leaves .noinstr.text section
>> vmlinux.o: warning: objtool: xen_safe_halt+0xa: call to 
>> HYPERVISOR_sched_op.constprop.0() leaves .noinstr.text section
>> 
>> Signed-off-by: Peter Zijlstra (Intel) 
> 
> Reviewed-by: Srivatsa S. Bhat (VMware) 
> 
>> 
>> -static inline void wbinvd(void)
>> +extern noinstr void pv_native_wbinvd(void);
>> +
>> +static __always_inline void wbinvd(void)
>> {
>>  PVOP_ALT_VCALL0(cpu.wbinvd, "wbinvd", ALT_NOT(X86_FEATURE_XENPV));
>> }

I guess it is yet another instance of wrong accounting of GCC for
the assembly blocks’ weight. I guess it is not a solution for older
GCCs, but presumably PVOP_ALT_CALL() and friends should have
used asm_inline or some new “asm_volatile_inline” variant.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue

2020-03-31 Thread Nadav Amit via Virtualization
> On Mar 31, 2020, at 7:09 AM, David Hildenbrand  wrote:
> 
> On 31.03.20 16:07, Michael S. Tsirkin wrote:
>> On Tue, Mar 31, 2020 at 04:03:18PM +0200, David Hildenbrand wrote:
>>> On 31.03.20 15:37, Michael S. Tsirkin wrote:
 On Tue, Mar 31, 2020 at 03:32:05PM +0200, David Hildenbrand wrote:
> On 31.03.20 15:24, Michael S. Tsirkin wrote:
>> On Tue, Mar 31, 2020 at 12:35:24PM +0200, David Hildenbrand wrote:
>>> On 26.03.20 10:49, Michael S. Tsirkin wrote:
 On Thu, Mar 26, 2020 at 08:54:04AM +0100, David Hildenbrand wrote:
>> Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin :
>> 
>> On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote:
 On 12.03.20 09:47, Michael S. Tsirkin wrote:
 On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote:
> 2. You are essentially stealing THPs in the guest. So the fastest
> mapping (THP in guest and host) is gone. The guest won't be able 
> to make
> use of THP where it previously was able to. I can imagine this 
> implies a
> performance degradation for some workloads. This needs a proper
> performance evaluation.
 
 I think the problem is more with the alloc_pages API.
 That gives you exactly the given order, and if there's
 a larger chunk available, it will split it up.
 
 But for balloon - I suspect lots of other users,
 we do not want to stress the system but if a large
 chunk is available anyway, then we could handle
 that more optimally by getting it all in one go.
 
 
 So if we want to address this, IMHO this calls for a new API.
 Along the lines of
 
   struct page *alloc_page_range(gfp_t gfp, unsigned int min_order,
   unsigned int max_order, unsigned int *order)
 
 the idea would then be to return at a number of pages in the given
 range.
 
 What do you think? Want to try implementing that?
>>> 
>>> You can just start with the highest order and decrement the order 
>>> until
>>> your allocation succeeds using alloc_pages(), which would be enough 
>>> for
>>> a first version. At least I don't see the immediate need for a new
>>> kernel API.
>> 
>> OK I remember now.  The problem is with reclaim. Unless reclaim is
>> completely disabled, any of these calls can sleep. After it wakes up,
>> we would like to get the larger order that has become available
>> meanwhile.
> 
> Yes, but that‘s a pure optimization IMHO.
> So I think we should do a trivial implementation first and then see 
> what we gain from a new allocator API. Then we might also be able to 
> justify it using real numbers.
 
 Well how do you propose implement the necessary semantics?
 I think we are both agreed that alloc_page_range is more or
 less what's necessary anyway - so how would you approximate it
 on top of existing APIs?
>>> diff --git a/include/linux/balloon_compaction.h 
>>> b/include/linux/balloon_compaction.h
 
 .
 
 
>>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>>> index 26de020aae7b..067810b32813 100644
>>> --- a/mm/balloon_compaction.c
>>> +++ b/mm/balloon_compaction.c
>>> @@ -112,23 +112,35 @@ size_t balloon_page_list_dequeue(struct 
>>> balloon_dev_info *b_dev_info,
>>> EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>>> 
>>> /*
>>> - * balloon_page_alloc - allocates a new page for insertion into the 
>>> balloon
>>> - * page list.
>>> + * balloon_pages_alloc - allocates a new page (of at most the given 
>>> order)
>>> + *  for insertion into the balloon page list.
>>>  *
>>>  * Driver must call this function to properly allocate a new balloon 
>>> page.
>>>  * Driver must call balloon_page_enqueue before definitively removing 
>>> the page
>>>  * from the guest system.
>>>  *
>>> + * Will fall back to smaller orders if allocation fails. The order of 
>>> the
>>> + * allocated page is stored in page->private.
>>> + *
>>>  * Return: struct page for the allocated page or NULL on allocation 
>>> failure.
>>>  */
>>> -struct page *balloon_page_alloc(void)
>>> +struct page *balloon_pages_alloc(int order)
>>> {
>>> -   struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>>> -  __GFP_NOMEMALLOC | __GFP_NORETRY 
>>> |
>>> -  __GFP_NOWARN);
>>> -   return page;

Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue

2020-03-31 Thread Nadav Amit via Virtualization
> On Mar 31, 2020, at 6:32 AM, David Hildenbrand  wrote:
> 
> On 31.03.20 15:24, Michael S. Tsirkin wrote:
>> On Tue, Mar 31, 2020 at 12:35:24PM +0200, David Hildenbrand wrote:
>>> On 26.03.20 10:49, Michael S. Tsirkin wrote:
 On Thu, Mar 26, 2020 at 08:54:04AM +0100, David Hildenbrand wrote:
>> Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin :
>> 
>> On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote:
 On 12.03.20 09:47, Michael S. Tsirkin wrote:
 On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote:
> 2. You are essentially stealing THPs in the guest. So the fastest
> mapping (THP in guest and host) is gone. The guest won't be able to 
> make
> use of THP where it previously was able to. I can imagine this 
> implies a
> performance degradation for some workloads. This needs a proper
> performance evaluation.
 
 I think the problem is more with the alloc_pages API.
 That gives you exactly the given order, and if there's
 a larger chunk available, it will split it up.
 
 But for balloon - I suspect lots of other users,
 we do not want to stress the system but if a large
 chunk is available anyway, then we could handle
 that more optimally by getting it all in one go.
 
 
 So if we want to address this, IMHO this calls for a new API.
 Along the lines of
 
   struct page *alloc_page_range(gfp_t gfp, unsigned int min_order,
   unsigned int max_order, unsigned int *order)
 
 the idea would then be to return at a number of pages in the given
 range.
 
 What do you think? Want to try implementing that?
>>> 
>>> You can just start with the highest order and decrement the order until
>>> your allocation succeeds using alloc_pages(), which would be enough for
>>> a first version. At least I don't see the immediate need for a new
>>> kernel API.
>> 
>> OK I remember now.  The problem is with reclaim. Unless reclaim is
>> completely disabled, any of these calls can sleep. After it wakes up,
>> we would like to get the larger order that has become available
>> meanwhile.
> 
> Yes, but that‘s a pure optimization IMHO.
> So I think we should do a trivial implementation first and then see what 
> we gain from a new allocator API. Then we might also be able to justify 
> it using real numbers.
 
 Well how do you propose implement the necessary semantics?
 I think we are both agreed that alloc_page_range is more or
 less what's necessary anyway - so how would you approximate it
 on top of existing APIs?
>>> 
>>> Looking at drivers/misc/vmw_balloon.c:vmballoon_inflate(), it first
>>> tries to allocate huge pages using
>>> 
>>> alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN| __GFP_NOMEMALLOC, 
>>>VMW_BALLOON_2M_ORDER)
>>> 
>>> And then falls back to 4k allocations (balloon_page_alloc()) in case
>>> allocation fails.
>>> 
>>> I'm roughly thinking of something like the following, but with an
>>> optimized reporting interface/bigger pfn array so we can report >
>>> 1MB at a time. Also, it might make sense to remember the order that
>>> succeeded across some fill_balloon() calls.
>>> 
>>> Don't even expect it to compile ...
>>> 
>>> 
>>> 
 From 4305f989672ccca4be9293e6d4167e929f3e299b Mon Sep 17 00:00:00 2001
>>> From: David Hildenbrand 
>>> Date: Tue, 31 Mar 2020 12:28:07 +0200
>>> Subject: [PATCH RFC] tmp
>>> 
>>> Signed-off-by: David Hildenbrand 
>>> ---
>>> drivers/virtio/virtio_balloon.c| 38 ++
>>> include/linux/balloon_compaction.h |  7 -
>>> mm/balloon_compaction.c| 43 +++---
>>> 3 files changed, 67 insertions(+), 21 deletions(-)
>>> 
>>> diff --git a/drivers/virtio/virtio_balloon.c 
>>> b/drivers/virtio/virtio_balloon.c
>>> index 8511d258dbb4..0660b1b988f0 100644
>>> --- a/drivers/virtio/virtio_balloon.c
>>> +++ b/drivers/virtio/virtio_balloon.c
>>> @@ -187,7 +187,7 @@ int virtballoon_free_page_report(struct 
>>> page_reporting_dev_info *pr_dev_info,
>>> }
>>> 
>>> static void set_page_pfns(struct virtio_balloon *vb,
>>> - __virtio32 pfns[], struct page *page)
>>> + __virtio32 pfns[], struct page *page, int order)
>>> {
>>> unsigned int i;
>>> 
>>> @@ -197,7 +197,7 @@ static void set_page_pfns(struct virtio_balloon *vb,
>>>  * Set balloon pfns pointing at this page.
>>>  * Note that the first pfn points at start of the page.
>>>  */
>>> -   for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
>>> +   for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE * (1 << order); i++)
>>> pfns[i] = cpu_to_virtio32(vb->vdev,
>>>   page_to_balloon_pfn(page) + i);
>>> }
>>> @@ 

Re: Balloon pressuring page cache

2020-02-04 Thread Nadav Amit via Virtualization
> On Feb 3, 2020, at 2:50 PM, Nadav Amit  wrote:
> 
>> On Feb 3, 2020, at 8:34 AM, David Hildenbrand  wrote:
>> 
>> On 03.02.20 17:18, Alexander Duyck wrote:
>>> On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
 On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W  wrote:
> 
>   On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
>> On 29.01.20 20:11, Tyler Sanderson wrote:
>>> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand >> > wrote:
>>> 
>>>   On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
 A primary advantage of virtio balloon over other memory reclaim
 mechanisms is that it can pressure the guest's page cache into
>>>   shrinking.
 However, since the balloon driver changed to using the shrinker
>   API
>> > e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
 use case has become a bit more tricky. I'm wondering what the
>> intended
 device implementation is.
 
 When inflating the balloon against page cache (i.e. no free
>   memory
 remains) vmscan.c will both shrink page cache, but also invoke
>   the
 shrinkers -- including the balloon's shrinker. So the balloon
>   driver
 allocates memory which requires reclaim, vmscan gets this memory
>> by
 shrinking the balloon, and then the driver adds the memory back
>   to
>> the
 balloon. Basically a busy no-op.
> 
>   Per my understanding, the balloon allocation won’t invoke shrinker as
>   __GFP_DIRECT_RECLAIM isn't set, no?
> 
> I could be wrong about the mechanism, but the device sees lots of 
> activity on
> the deflate queue. The balloon is being shrunk. And this only starts once 
> all
> free memory is depleted and we're inflating into page cache.
 
 So given this looks like a regression, maybe we should revert the
 patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with 
 shrinker")
 Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
 shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
 at all.
 
 So it looks like all this rework introduced more issues than it
 addressed ...
 
 I also CC Alex Duyck for an opinion on this.
 Alex, what do you use to put pressure on page cache?
>>> 
>>> I would say reverting probably makes sense. I'm not sure there is much
>>> value to having a shrinker running deflation when you are actively trying
>>> to increase the balloon. It would make more sense to wait until you are
>>> actually about to start hitting oom.
>> 
>> I think the shrinker makes sense for free page hinting feature
>> (everything on free_page_list).
>> 
>> So instead of only reverting, I think we should split it up and always
>> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
>> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
>> 
>> (Of course, adapting what is being done in the shrinker and in the OOM
>> notifier)
> 
> David,
> 
> Please keep me posted. I decided to adapt the same solution as the virtio
> balloon for the VMware balloon. If the verdict is that this is damaging and
> the OOM notifier should be used instead, I will submit patches to move to
> OOM notifier as well.

Adding some information for the record (if someone googles this thread):

In the VMware balloon driver, the shrinker is disabled by default since we
encountered a performance degradation in testing. I tried to avoid rapid
inflation/shrinker-deflation cycles by adding a timeout, but apparently it
did not help in avoiding the performance regression.

So there is no such issue in VMware balloon driver, unless someone
intentionally enables the shrinker through a module parameter.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Balloon pressuring page cache

2020-02-03 Thread Nadav Amit via Virtualization
> On Feb 3, 2020, at 8:34 AM, David Hildenbrand  wrote:
> 
> On 03.02.20 17:18, Alexander Duyck wrote:
>> On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
>>> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
 On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W  wrote:
 
On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
> On 29.01.20 20:11, Tyler Sanderson wrote:
>> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand > > wrote:
>> 
>>On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>>> A primary advantage of virtio balloon over other memory reclaim
>>> mechanisms is that it can pressure the guest's page cache into
>>shrinking.
>>> However, since the balloon driver changed to using the shrinker
API
>  e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>>> use case has become a bit more tricky. I'm wondering what the
> intended
>>> device implementation is.
>>> 
>>> When inflating the balloon against page cache (i.e. no free
memory
>>> remains) vmscan.c will both shrink page cache, but also invoke
the
>>> shrinkers -- including the balloon's shrinker. So the balloon
driver
>>> allocates memory which requires reclaim, vmscan gets this memory
> by
>>> shrinking the balloon, and then the driver adds the memory back
to
> the
>>> balloon. Basically a busy no-op.
 
Per my understanding, the balloon allocation won’t invoke shrinker as
__GFP_DIRECT_RECLAIM isn't set, no?
 
 I could be wrong about the mechanism, but the device sees lots of activity 
 on
 the deflate queue. The balloon is being shrunk. And this only starts once 
 all
 free memory is depleted and we're inflating into page cache.
>>> 
>>> So given this looks like a regression, maybe we should revert the
>>> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with 
>>> shrinker")
>>> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
>>> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
>>> at all.
>>> 
>>> So it looks like all this rework introduced more issues than it
>>> addressed ...
>>> 
>>> I also CC Alex Duyck for an opinion on this.
>>> Alex, what do you use to put pressure on page cache?
>> 
>> I would say reverting probably makes sense. I'm not sure there is much
>> value to having a shrinker running deflation when you are actively trying
>> to increase the balloon. It would make more sense to wait until you are
>> actually about to start hitting oom.
> 
> I think the shrinker makes sense for free page hinting feature
> (everything on free_page_list).
> 
> So instead of only reverting, I think we should split it up and always
> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
> 
> (Of course, adapting what is being done in the shrinker and in the OOM
> notifier)

David,

Please keep me posted. I decided to adapt the same solution as the virtio
balloon for the VMware balloon. If the verdict is that this is damaging and
the OOM notifier should be used instead, I will submit patches to move to
OOM notifier as well.

Regards,
Nadav

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 0/9] x86/tlb: Concurrent TLB flushes

2019-08-24 Thread Nadav Amit via Virtualization
[ Similar cover-letter to v3 with updated performance numbers on skylake.
  Sorry for the time it since the last version. ]

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each PTE flushing can take 100s
of cycles. This patch-set allows TLB flushes to be run concurrently:
first request the remote CPUs to initiate the flush, then run it
locally, and finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid, for
example, unwarranted false-sharing.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 56-logical-cores (28+SMT) Skylake, 5 repetitions:

sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
 --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

 Th.   tip-aug22 avg (stdev)   +patch-set avg (stdev)  change
 ---   -   --  --
 1  1152920 (7453)  1169469 (9059)  +1.4%
 2  1545832 (12555) 1584172 (10484) +2.4%
 4  2480703 (12039) 2518641 (12875) +1.5%
 8  3684486 (26007) 3840343 (44144) +4.2%
 16 4981567 (23565) 5125756 (15458) +2.8%
 32 5679542 (10116) 5887826 (6121)  +3.6%
 56 5630944 (17937) 5812514 (26264) +3.2%

(Note that on configurations with up to 28 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

 Th.   tip-aug22 avg (stdev)   +patch-set avg (stdev)  change
 ---   -   --  --
 1  1444119 (8524)  1469606 (10527) +1.7%
 2  1921540 (24169) 1961899 (14450) +2.1%
 4  3073716 (21786) 3199880 (16774) +4.1%
 8  4700698 (49534) 4802312 (11043) +2.1%
 16 6005180 (6366)  6006656 (31624)0%
 32 6826466 (10496) 6886622 (19110) +0.8%
 56 6832344 (13468) 6885586 (20646) +0.8%

The results are somewhat different than the results that have been obtained on
Haswell-X, which were sent before and the maximum performance improvement is
smaller. However, the performance improvement is significant.

v3 -> v4:
  * Merge flush_tlb_func_local and flush_tlb_func_remote() [Peter]
  * Prevent preemption on_each_cpu(). It is not needed, but it prevents
concerns. [Peter/tglx]
  * Adding acked-, review-by tags

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c |  10 +-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  52 +++
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c |  11 +-
 arch/x86/kernel/paravirt.c|   2 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 195 ++
 arch/x86/xen/mmu_pv.c |  11 +-
 include/linux/cpumask.h 

[PATCH v4 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-08-24 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Reviewed-by: Michael Kelley  # Hyper-v parts
Reviewed-by: Juergen Gross  # Xen and paravirt parts
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  8 ++---
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 45 ++-
 arch/x86/xen/mmu_pv.c | 11 +++
 include/trace/events/xen.h|  2 +-
 10 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..8740d8b21db3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 69089d46f128..bc4829c9b3f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-   const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 70b654f3ffe5..63fa751344bf 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -206,8 +206,8 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
-   void (*flush_tlb_others)(const struct cpumask *cpus,
-const struct flush_tlb_info *info);
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   

Re: [PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 12:13 PM, David Hildenbrand  wrote:
> 
> On 21.08.19 18:34, Nadav Amit wrote:
>>> On Aug 21, 2019, at 9:29 AM, David Hildenbrand  wrote:
>>> 
>>> On 21.08.19 18:23, Nadav Amit wrote:
> On Aug 21, 2019, at 9:05 AM, David Hildenbrand  wrote:
> 
> On 20.08.19 11:16, Nadav Amit wrote:
>> There is no reason to print warnings when balloon page allocation fails,
>> as they are expected and can be handled gracefully.  Since VMware
>> balloon now uses balloon-compaction infrastructure, and suppressed these
>> warnings before, it is also beneficial to suppress these warnings to
>> keep the same behavior that the balloon had before.
> 
> I am not sure if that's a good idea. The allocation warnings are usually
> the only trace of "the user/admin did something bad because he/she tried
> to inflate the balloon to an unsafe value". Believe me, I processed a
> couple of such bugreports related to virtio-balloon and the warning were
> very helpful for that.
 
 Ok, so a message is needed, but does it have to be a generic frightening
 warning?
 
 How about using __GFP_NOWARN, and if allocation do something like:
 
 pr_warn(“Balloon memory allocation failed”);
 
 Or even something more informative? This would surely be less intimidating
 for common users.
>>> 
>>> ratelimit would make sense :)
>>> 
>>> And yes, this would certainly be nicer.
>> 
>> Thanks. I will post v2 of the patch.
> 
> As discussed in v2, we already print a warning in virtio-balloon, so I
> am fine with this patch.
> 
> Reviewed-by: David Hildenbrand 

Michael,

If it is possible to get it to 5.3, to avoid behavioral change for VMware
balloon users, it would be great.

Thanks,
Nadav
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 12:12 PM, David Hildenbrand  wrote:
> 
> On 21.08.19 21:10, Nadav Amit wrote:
>>> On Aug 21, 2019, at 12:06 PM, David Hildenbrand  wrote:
>>> 
>>> On 21.08.19 20:59, Nadav Amit wrote:
> On Aug 21, 2019, at 11:57 AM, David Hildenbrand  wrote:
> 
> On 21.08.19 11:41, Nadav Amit wrote:
>> There is no reason to print generic warnings when balloon memory
>> allocation fails, as failures are expected and can be handled
>> gracefully. Since VMware balloon now uses balloon-compaction
>> infrastructure, and suppressed these warnings before, it is also
>> beneficial to suppress these warnings to keep the same behavior that the
>> balloon had before.
>> 
>> Since such warnings can still be useful to indicate that the balloon is
>> over-inflated, print more informative and less frightening warning if
>> allocation fails instead.
>> 
>> Cc: David Hildenbrand 
>> Cc: Jason Wang 
>> Signed-off-by: Nadav Amit 
>> 
>> ---
>> 
>> v1->v2:
>> * Print informative warnings instead suppressing [David]
>> ---
>> mm/balloon_compaction.c | 7 ++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>> index 798275a51887..0c1d1f7689f0 100644
>> --- a/mm/balloon_compaction.c
>> +++ b/mm/balloon_compaction.c
>> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>> struct page *balloon_page_alloc(void)
>> {
>>  struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>> -   __GFP_NOMEMALLOC | 
>> __GFP_NORETRY);
>> +   __GFP_NOMEMALLOC | __GFP_NORETRY 
>> |
>> +   __GFP_NOWARN);
>> +
>> +if (!page)
>> +pr_warn_ratelimited("memory balloon: memory allocation 
>> failed");
>> +
>>  return page;
>> }
>> EXPORT_SYMBOL_GPL(balloon_page_alloc);
> 
> Not sure if "memory balloon" is the right wording. hmmm.
> 
> Acked-by: David Hildenbrand 
 
 Do you have a better suggestion?
>>> 
>>> Not really - that's why I ack'ed :)
>>> 
>>> However, thinking about it - what about moving the check + print to the
>>> caller and then using dev_warn... or sth. like simple "virtio_balloon:
>>> ..." ? You can then drop the warning for vmware balloon if you feel like
>>> not needing it.
>> 
>> Actually, there is already a warning that is printed by the virtue_balloon
>> in fill_balloon():
>> 
>>struct page *page = balloon_page_alloc();
>> 
>>if (!page) {
>>dev_info_ratelimited(>vdev->dev,
>> "Out of puff! Can't get %u 
>> pages\n",
>> VIRTIO_BALLOON_PAGES_PER_PAGE);
>>/* Sleep for at least 1/5 of a second before retry. */
>>msleep(200);
>>break;
>>}
>> 
>> So are you ok with going back to v1?
> 
> Whoops, I missed that - sorry - usually the warnings scream louder at me :D
> 
> Yes, v1 is fine with me!

Thanks, I missed this one too. This change should prevent making users
concerned for no good reason.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 12:06 PM, David Hildenbrand  wrote:
> 
> On 21.08.19 20:59, Nadav Amit wrote:
>>> On Aug 21, 2019, at 11:57 AM, David Hildenbrand  wrote:
>>> 
>>> On 21.08.19 11:41, Nadav Amit wrote:
 There is no reason to print generic warnings when balloon memory
 allocation fails, as failures are expected and can be handled
 gracefully. Since VMware balloon now uses balloon-compaction
 infrastructure, and suppressed these warnings before, it is also
 beneficial to suppress these warnings to keep the same behavior that the
 balloon had before.
 
 Since such warnings can still be useful to indicate that the balloon is
 over-inflated, print more informative and less frightening warning if
 allocation fails instead.
 
 Cc: David Hildenbrand 
 Cc: Jason Wang 
 Signed-off-by: Nadav Amit 
 
 ---
 
 v1->v2:
 * Print informative warnings instead suppressing [David]
 ---
 mm/balloon_compaction.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)
 
 diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
 index 798275a51887..0c1d1f7689f0 100644
 --- a/mm/balloon_compaction.c
 +++ b/mm/balloon_compaction.c
 @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
 struct page *balloon_page_alloc(void)
 {
struct page *page = alloc_page(balloon_mapping_gfp_mask() |
 - __GFP_NOMEMALLOC | __GFP_NORETRY);
 + __GFP_NOMEMALLOC | __GFP_NORETRY |
 + __GFP_NOWARN);
 +
 +  if (!page)
 +  pr_warn_ratelimited("memory balloon: memory allocation failed");
 +
return page;
 }
 EXPORT_SYMBOL_GPL(balloon_page_alloc);
>>> 
>>> Not sure if "memory balloon" is the right wording. hmmm.
>>> 
>>> Acked-by: David Hildenbrand 
>> 
>> Do you have a better suggestion?
> 
> Not really - that's why I ack'ed :)
> 
> However, thinking about it - what about moving the check + print to the
> caller and then using dev_warn... or sth. like simple "virtio_balloon:
> ..." ? You can then drop the warning for vmware balloon if you feel like
> not needing it.

Actually, there is already a warning that is printed by the virtue_balloon
in fill_balloon():

struct page *page = balloon_page_alloc();

if (!page) {
dev_info_ratelimited(>vdev->dev,
 "Out of puff! Can't get %u 
pages\n",
 VIRTIO_BALLOON_PAGES_PER_PAGE);
/* Sleep for at least 1/5 of a second before retry. */
msleep(200);
break;
}

So are you ok with going back to v1?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 11:57 AM, David Hildenbrand  wrote:
> 
> On 21.08.19 11:41, Nadav Amit wrote:
>> There is no reason to print generic warnings when balloon memory
>> allocation fails, as failures are expected and can be handled
>> gracefully. Since VMware balloon now uses balloon-compaction
>> infrastructure, and suppressed these warnings before, it is also
>> beneficial to suppress these warnings to keep the same behavior that the
>> balloon had before.
>> 
>> Since such warnings can still be useful to indicate that the balloon is
>> over-inflated, print more informative and less frightening warning if
>> allocation fails instead.
>> 
>> Cc: David Hildenbrand 
>> Cc: Jason Wang 
>> Signed-off-by: Nadav Amit 
>> 
>> ---
>> 
>> v1->v2:
>>  * Print informative warnings instead suppressing [David]
>> ---
>> mm/balloon_compaction.c | 7 ++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>> index 798275a51887..0c1d1f7689f0 100644
>> --- a/mm/balloon_compaction.c
>> +++ b/mm/balloon_compaction.c
>> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>> struct page *balloon_page_alloc(void)
>> {
>>  struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>> -   __GFP_NOMEMALLOC | __GFP_NORETRY);
>> +   __GFP_NOMEMALLOC | __GFP_NORETRY |
>> +   __GFP_NOWARN);
>> +
>> +if (!page)
>> +pr_warn_ratelimited("memory balloon: memory allocation failed");
>> +
>>  return page;
>> }
>> EXPORT_SYMBOL_GPL(balloon_page_alloc);
> 
> Not sure if "memory balloon" is the right wording. hmmm.
> 
> Acked-by: David Hildenbrand 

Do you have a better suggestion?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
There is no reason to print generic warnings when balloon memory
allocation fails, as failures are expected and can be handled
gracefully. Since VMware balloon now uses balloon-compaction
infrastructure, and suppressed these warnings before, it is also
beneficial to suppress these warnings to keep the same behavior that the
balloon had before.

Since such warnings can still be useful to indicate that the balloon is
over-inflated, print more informative and less frightening warning if
allocation fails instead.

Cc: David Hildenbrand 
Cc: Jason Wang 
Signed-off-by: Nadav Amit 

---

v1->v2:
  * Print informative warnings instead suppressing [David]
---
 mm/balloon_compaction.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 798275a51887..0c1d1f7689f0 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
 struct page *balloon_page_alloc(void)
 {
struct page *page = alloc_page(balloon_mapping_gfp_mask() |
-  __GFP_NOMEMALLOC | __GFP_NORETRY);
+  __GFP_NOMEMALLOC | __GFP_NORETRY |
+  __GFP_NOWARN);
+
+   if (!page)
+   pr_warn_ratelimited("memory balloon: memory allocation failed");
+
return page;
 }
 EXPORT_SYMBOL_GPL(balloon_page_alloc);
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 9:29 AM, David Hildenbrand  wrote:
> 
> On 21.08.19 18:23, Nadav Amit wrote:
>>> On Aug 21, 2019, at 9:05 AM, David Hildenbrand  wrote:
>>> 
>>> On 20.08.19 11:16, Nadav Amit wrote:
 There is no reason to print warnings when balloon page allocation fails,
 as they are expected and can be handled gracefully.  Since VMware
 balloon now uses balloon-compaction infrastructure, and suppressed these
 warnings before, it is also beneficial to suppress these warnings to
 keep the same behavior that the balloon had before.
>>> 
>>> I am not sure if that's a good idea. The allocation warnings are usually
>>> the only trace of "the user/admin did something bad because he/she tried
>>> to inflate the balloon to an unsafe value". Believe me, I processed a
>>> couple of such bugreports related to virtio-balloon and the warning were
>>> very helpful for that.
>> 
>> Ok, so a message is needed, but does it have to be a generic frightening
>> warning?
>> 
>> How about using __GFP_NOWARN, and if allocation do something like:
>> 
>>  pr_warn(“Balloon memory allocation failed”);
>> 
>> Or even something more informative? This would surely be less intimidating
>> for common users.
> 
> ratelimit would make sense :)
> 
> And yes, this would certainly be nicer.

Thanks. I will post v2 of the patch.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 9:05 AM, David Hildenbrand  wrote:
> 
> On 20.08.19 11:16, Nadav Amit wrote:
>> There is no reason to print warnings when balloon page allocation fails,
>> as they are expected and can be handled gracefully.  Since VMware
>> balloon now uses balloon-compaction infrastructure, and suppressed these
>> warnings before, it is also beneficial to suppress these warnings to
>> keep the same behavior that the balloon had before.
> 
> I am not sure if that's a good idea. The allocation warnings are usually
> the only trace of "the user/admin did something bad because he/she tried
> to inflate the balloon to an unsafe value". Believe me, I processed a
> couple of such bugreports related to virtio-balloon and the warning were
> very helpful for that.

Ok, so a message is needed, but does it have to be a generic frightening
warning?

How about using __GFP_NOWARN, and if allocation do something like:

  pr_warn(“Balloon memory allocation failed”);

Or even something more informative? This would surely be less intimidating
for common users.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-20 Thread Nadav Amit via Virtualization
There is no reason to print warnings when balloon page allocation fails,
as they are expected and can be handled gracefully.  Since VMware
balloon now uses balloon-compaction infrastructure, and suppressed these
warnings before, it is also beneficial to suppress these warnings to
keep the same behavior that the balloon had before.

Cc: Jason Wang 
Signed-off-by: Nadav Amit 
---
 mm/balloon_compaction.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 798275a51887..26de020aae7b 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -124,7 +124,8 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
 struct page *balloon_page_alloc(void)
 {
struct page *page = alloc_page(balloon_mapping_gfp_mask() |
-  __GFP_NOMEMALLOC | __GFP_NORETRY);
+  __GFP_NOMEMALLOC | __GFP_NORETRY |
+  __GFP_NOWARN);
return page;
 }
 EXPORT_SYMBOL_GPL(balloon_page_alloc);
-- 
2.19.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-22 Thread Nadav Amit via Virtualization
> On Jul 22, 2019, at 12:14 PM, Peter Zijlstra  wrote:
> 
> On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote:
>> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask 
>> *cpumask,
>>   * doing a speculative memory access.
>>   */
>>  if (info->freed_tables) {
>> -smp_call_function_many(cpumask, flush_tlb_func_remote,
>> -   (void *)info, 1);
>> +__smp_call_function_many(cpumask, flush_tlb_func_remote,
>> + flush_tlb_func_local,
>> + (void *)info, 1);
>>  } else {
>>  /*
>>   * Although we could have used on_each_cpu_cond_mask(),
>> @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask 
>> *cpumask,
>>  if (tlb_is_not_lazy(cpu))
>>  __cpumask_set_cpu(cpu, cond_cpumask);
>>  }
>> -smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> +__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> + flush_tlb_func_local,
>>   (void *)info, 1);
>>  }
>> }
> 
> Do we really need that _local/_remote distinction? ISTR you had a patch
> that frobbed flush_tlb_info into the csd and that gave space
> constraints, but I'm not seeing that here (probably a wise, get stuff
> merged etc..).
> 
> struct __call_single_data {
>struct llist_node  llist;/* 0 8 */
>smp_call_func_tfunc; /* 8 8 */
>void * info; /*16 8 */
>unsigned int   flags;/*24 4 */
> 
>/* size: 32, cachelines: 1, members: 4 */
>/* padding: 4 */
>/* last cacheline: 32 bytes */
> };
> 
> struct flush_tlb_info {
>struct mm_struct * mm;   /* 0 8 */
>long unsigned int  start;/* 8 8 */
>long unsigned int  end;  /*16 8 */
>u64new_tlb_gen;  /*24 8 */
>unsigned int   stride_shift; /*32 4 */
>bool   freed_tables; /*36 1 */
> 
>/* size: 40, cachelines: 1, members: 6 */
>/* padding: 3 */
>/* last cacheline: 40 bytes */
> };
> 
> IIRC what you did was make void *__call_single_data::info the last
> member and a union until the full cacheline size (64). Given the above
> that would get us 24 bytes for csd, leaving us 40 for that
> flush_tlb_info.
> 
> But then we can still do something like the below, which doesn't change
> things and still gets rid of that dual function crud, simplifying
> smp_call_function_many again.
> 
> Index: linux-2.6/arch/x86/include/asm/tlbflush.h
> ===
> --- linux-2.6.orig/arch/x86/include/asm/tlbflush.h
> +++ linux-2.6/arch/x86/include/asm/tlbflush.h
> @@ -546,8 +546,9 @@ struct flush_tlb_info {
>   unsigned long   start;
>   unsigned long   end;
>   u64 new_tlb_gen;
> - unsigned intstride_shift;
> - boolfreed_tables;
> + unsigned intcpu;
> + unsigned short  stride_shift;
> + unsigned char   freed_tables;
> };
> 
> #define local_flush_tlb() __flush_tlb()
> Index: linux-2.6/arch/x86/mm/tlb.c
> ===
> --- linux-2.6.orig/arch/x86/mm/tlb.c
> +++ linux-2.6/arch/x86/mm/tlb.c
> @@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void *
>   flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
> }
> 
> +static void flush_tlb_func(void *info)
> +{
> + const struct flush_tlb_info *f = info;
> + enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN;
> + bool local = false;
> +
> + if (f->cpu == smp_processor_id()) {
> + local = true;
> + reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : 
> TLB_LOCAL_MM_SHOOTDOWN;
> + } else {
> + inc_irq_stat(irq_tlb_count);
> +
> + if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
> + return;
> +
> + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> + }
> +
> + flush_tlb_func_common(f, local, reason);
> +}
> +
> static bool tlb_is_not_lazy(int cpu)
> {
>   return !per_cpu(cpu_tlbstate_shared.is_lazy, cpu);

Nice! I will add it on top, if you don’t mind (instead squashing it).

The original decision to have local/remote functions was mostly to provide
the generality.

I would change the last argument of __smp_call_function_many() from “wait”
to “flags” that would indicate whether to run the function locally, 

[PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-18 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  8 ++---
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 47 ++-
 arch/x86/xen/mmu_pv.c | 11 +++
 include/trace/events/xen.h|  2 +-
 10 files changed, 62 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..8740d8b21db3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index dce26f1d13e1..8c6c2394393b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-   const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 639b2df445ee..c82969f38845 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,8 +211,8 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
-   void (*flush_tlb_others)(const struct cpumask *cpus,
-const struct flush_tlb_info *info);
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   const struct flush_tlb_info *info);
 
void (*tlb_remove_table)(struct mmu_gather *tlb, void *table);
 

[PATCH v3 0/9] x86: Concurrent TLB flushes

2019-07-18 Thread Nadav Amit via Virtualization
[ Cover-letter is identical to v2, including benchmark results,
  excluding the change log. ] 

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions:

 sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
  --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   -   --  --
  1 1267765 (14146) 1299253 (5715)  +2.4%
  2 1734644 (11936) 1799225 (19577) +3.7%
  4 2821268 (41184) 2919132 (40149) +3.4%
  8 4171652 (31243) 4376925 (65416) +4.9%
  165590729 (24160) 5829866 (8127)  +4.2%
  246250212 (24481) 6522303 (28044) +4.3%
  323994314 (26606) 4077543 (10685) +2.0%
  484345177 (28091) 4417821 (41337) +1.6%

(Note that on configurations with up to 24 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   -   --  --
  1 1598896 (5174)  1607903 (4091)  +0.5%
  2 2109472 (17827) 2224726 (4372)  +5.4%
  4 3448587 (11952) 3668551 (30219) +6.3%
  8 5425778 (29641) 5606266 (33519) +3.3%
  166931232 (34677) 7054052 (27873) +1.7%
  247612473 (23482) 7783138 (13871) +2.2%
  324296274 (18029) 4283279 (32323) -0.3%
  484770029 (35541) 4764760 (13575) -0.1%

Presumably, PTI requires two invalidations of each mapping, which allows
to get higher benefits from concurrency when PTI is on. At the same
time, when mitigations are on, other overheads reduce the potential
speedup.

I tried to reduce the size of the code of the main patch, which required
restructuring of the series.

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c |  10 +-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  47 -
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c |  11 ++-
 arch/x86/kernel/paravirt.c|   2 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 133 --
 arch/x86/xen/mmu_pv.c |  11 +--
 include/linux/cpumask.h

Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Nadav Amit via Virtualization
> On Jul 15, 2019, at 4:30 PM, Andrew Cooper  wrote:
> 
> On 15/07/2019 19:17, Nadav Amit wrote:
>>> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  
>>> wrote:
>>> 
>>> There is a lot of infrastructure for functionality which is used
>>> exclusively in __{save,restore}_processor_state() on the suspend/resume
>>> path.
>>> 
>>> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
>>> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
>>> rest of the Local APIC state isn't a clever thing to be doing.
>>> 
>>> Delete the suspend/resume cr8 handling, which shrinks the size of struct
>>> saved_context, and allows for the removal of both PVOPS.
>> I think removing the interface for CR8 writes is also good to avoid
>> potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
>> Priorities between CR8 and APIC”):
>> 
>> "Operating software should implement either direct APIC TPR updates or CR8
>> style TPR updates but not mix them. Software can use a serializing
>> instruction (for example, CPUID) to serialize updates between MOV CR8 and
>> stores to the APIC.”
>> 
>> And native_write_cr8() did not even issue a serializing instruction.
> 
> Given its location, the one write_cr8() is bounded by two serialising
> operations, so is safe in practice.

That’s what the “potential” in "potential correctness issues” means :)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Nadav Amit via Virtualization
> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  wrote:
> 
> There is a lot of infrastructure for functionality which is used
> exclusively in __{save,restore}_processor_state() on the suspend/resume
> path.
> 
> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
> rest of the Local APIC state isn't a clever thing to be doing.
> 
> Delete the suspend/resume cr8 handling, which shrinks the size of struct
> saved_context, and allows for the removal of both PVOPS.

I think removing the interface for CR8 writes is also good to avoid
potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
Priorities between CR8 and APIC”):

"Operating software should implement either direct APIC TPR updates or CR8
style TPR updates but not mix them. Software can use a serializing
instruction (for example, CPUID) to serialize updates between MOV CR8 and
stores to the APIC.”

And native_write_cr8() did not even issue a serializing instruction.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-03 Thread Nadav Amit via Virtualization
> On Jul 3, 2019, at 10:43 AM, Andrew Cooper  wrote:
> 
> On 03/07/2019 18:02, Nadav Amit wrote:
>>> On Jul 3, 2019, at 7:04 AM, Juergen Gross  wrote:
>>> 
>>> On 03.07.19 01:51, Nadav Amit wrote:
 To improve TLB shootdown performance, flush the remote and local TLBs
 concurrently. Introduce flush_tlb_multi() that does so. Introduce
 paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
 and hyper-v are only compile-tested).
 While the updated smp infrastructure is capable of running a function on
 a single local core, it is not optimized for this case. The multiple
 function calls and the indirect branch introduce some overhead, and
 might make local TLB flushes slower than they were before the recent
 changes.
 Before calling the SMP infrastructure, check if only a local TLB flush
 is needed to restore the lost performance in this common case. This
 requires to check mm_cpumask() one more time, but unless this mask is
 updated very frequently, this should impact performance negatively.
 Cc: "K. Y. Srinivasan" 
 Cc: Haiyang Zhang 
 Cc: Stephen Hemminger 
 Cc: Sasha Levin 
 Cc: Thomas Gleixner 
 Cc: Ingo Molnar 
 Cc: Borislav Petkov 
 Cc: x...@kernel.org
 Cc: Juergen Gross 
 Cc: Paolo Bonzini 
 Cc: Dave Hansen 
 Cc: Andy Lutomirski 
 Cc: Peter Zijlstra 
 Cc: Boris Ostrovsky 
 Cc: linux-hyp...@vger.kernel.org
 Cc: linux-ker...@vger.kernel.org
 Cc: virtualization@lists.linux-foundation.org
 Cc: k...@vger.kernel.org
 Cc: xen-de...@lists.xenproject.org
 Signed-off-by: Nadav Amit 
 ---
 arch/x86/hyperv/mmu.c | 13 +++---
 arch/x86/include/asm/paravirt.h   |  6 +--
 arch/x86/include/asm/paravirt_types.h |  4 +-
 arch/x86/include/asm/tlbflush.h   |  9 ++--
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +++--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 65 ---
 arch/x86/xen/mmu_pv.c | 20 ++---
 include/trace/events/xen.h|  2 +-
 10 files changed, 91 insertions(+), 43 deletions(-)
>>> ...
>>> 
 diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
 index beb44e22afdf..19e481e6e904 100644
 --- a/arch/x86/xen/mmu_pv.c
 +++ b/arch/x86/xen/mmu_pv.c
 @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long 
 addr)
preempt_enable();
 }
 -static void xen_flush_tlb_others(const struct cpumask *cpus,
 -   const struct flush_tlb_info *info)
 +static void xen_flush_tlb_multi(const struct cpumask *cpus,
 +  const struct flush_tlb_info *info)
 {
struct {
struct mmuext_op op;
 @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct 
 cpumask *cpus,
const size_t mc_entry_size = sizeof(args->op) +
sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
 -  trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
 +  trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
if (cpumask_empty(cpus))
return; /* nothing to do */
 @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct 
 cpumask *cpus,
args = mcs.args;
args->op.arg2.vcpumask = to_cpumask(args->mask);
 -  /* Remove us, and any offline CPUS. */
 +  /* Flush locally if needed and remove us */
 +  if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
 +  local_irq_disable();
 +  flush_tlb_func_local(info);
>>> I think this isn't the correct function for PV guests.
>>> 
>>> In fact it should be much easier: just don't clear the own cpu from the
>>> mask, that's all what's needed. The hypervisor is just fine having the
>>> current cpu in the mask and it will do the right thing.
>> Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
>> the same, unfortunately, since it would induce VM-exit on TLB flushes.
> 
> Why do you believe the vmexit matters?  You're talking one anyway for
> the IPI.
> 
> Intel only have virtualised self-IPI, and while AMD do have working
> non-self IPIs, you still take a vmexit anyway if any destination vcpu
> isn't currently running in non-root mode (IIRC).
> 
> At that point, you might as well have the hypervisor do all the hard
> work via a multi-cpu shootdown/flush hypercall, rather than trying to
> arrange it locally.

I forgot that xen_flush_tlb_multi() should actually only be called when
there are some remote CPUs (as I optimized the case in which there is only a
single local CPU that needs to be flushed), so you are right.

___
Virtualization mailing list

Re: [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-03 Thread Nadav Amit via Virtualization
> On Jul 3, 2019, at 7:04 AM, Juergen Gross  wrote:
> 
> On 03.07.19 01:51, Nadav Amit wrote:
>> To improve TLB shootdown performance, flush the remote and local TLBs
>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>> and hyper-v are only compile-tested).
>> While the updated smp infrastructure is capable of running a function on
>> a single local core, it is not optimized for this case. The multiple
>> function calls and the indirect branch introduce some overhead, and
>> might make local TLB flushes slower than they were before the recent
>> changes.
>> Before calling the SMP infrastructure, check if only a local TLB flush
>> is needed to restore the lost performance in this common case. This
>> requires to check mm_cpumask() one more time, but unless this mask is
>> updated very frequently, this should impact performance negatively.
>> Cc: "K. Y. Srinivasan" 
>> Cc: Haiyang Zhang 
>> Cc: Stephen Hemminger 
>> Cc: Sasha Levin 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> Cc: Borislav Petkov 
>> Cc: x...@kernel.org
>> Cc: Juergen Gross 
>> Cc: Paolo Bonzini 
>> Cc: Dave Hansen 
>> Cc: Andy Lutomirski 
>> Cc: Peter Zijlstra 
>> Cc: Boris Ostrovsky 
>> Cc: linux-hyp...@vger.kernel.org
>> Cc: linux-ker...@vger.kernel.org
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: k...@vger.kernel.org
>> Cc: xen-de...@lists.xenproject.org
>> Signed-off-by: Nadav Amit 
>> ---
>>  arch/x86/hyperv/mmu.c | 13 +++---
>>  arch/x86/include/asm/paravirt.h   |  6 +--
>>  arch/x86/include/asm/paravirt_types.h |  4 +-
>>  arch/x86/include/asm/tlbflush.h   |  9 ++--
>>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>  arch/x86/kernel/kvm.c | 11 +++--
>>  arch/x86/kernel/paravirt.c|  2 +-
>>  arch/x86/mm/tlb.c | 65 ---
>>  arch/x86/xen/mmu_pv.c | 20 ++---
>>  include/trace/events/xen.h|  2 +-
>>  10 files changed, 91 insertions(+), 43 deletions(-)
> 
> ...
> 
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index beb44e22afdf..19e481e6e904 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>  preempt_enable();
>>  }
>>  -static void xen_flush_tlb_others(const struct cpumask *cpus,
>> - const struct flush_tlb_info *info)
>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>> +const struct flush_tlb_info *info)
>>  {
>>  struct {
>>  struct mmuext_op op;
>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask 
>> *cpus,
>>  const size_t mc_entry_size = sizeof(args->op) +
>>  sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>  -   trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>> +trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>  if (cpumask_empty(cpus))
>>  return; /* nothing to do */
>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask 
>> *cpus,
>>  args = mcs.args;
>>  args->op.arg2.vcpumask = to_cpumask(args->mask);
>>  -   /* Remove us, and any offline CPUS. */
>> +/* Flush locally if needed and remove us */
>> +if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>> +local_irq_disable();
>> +flush_tlb_func_local(info);
> 
> I think this isn't the correct function for PV guests.
> 
> In fact it should be much easier: just don't clear the own cpu from the
> mask, that's all what's needed. The hypervisor is just fine having the
> current cpu in the mask and it will do the right thing.

Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
the same, unfortunately, since it would induce VM-exit on TLB flushes. But
if they do - I’ll be able not to expose flush_tlb_func_local().

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-03 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 13 +++---
 arch/x86/include/asm/paravirt.h   |  6 +--
 arch/x86/include/asm/paravirt_types.h |  4 +-
 arch/x86/include/asm/tlbflush.h   |  9 ++--
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +++--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 65 ---
 arch/x86/xen/mmu_pv.c | 20 ++---
 include/trace/events/xen.h|  2 +-
 10 files changed, 91 insertions(+), 43 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..1177f863e4cd 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -69,6 +69,9 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
 
local_irq_save(flags);
 
+   if (cpumask_test_cpu(smp_processor_id(), cpus))
+   flush_tlb_func_local(info);
+
flush_pcpu = (struct hv_tlb_flush **)
 this_cpu_ptr(hyperv_pcpu_input_arg);
 
@@ -156,7 +159,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +234,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..316959e89258 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-   const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..54f4c718b5b0 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,8 +211,8 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
-   

[PATCH v2 0/9] x86: Concurrent TLB flushes

2019-07-03 Thread Nadav Amit via Virtualization
Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions:

 sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
  --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   -   --  --
  1 1267765 (14146) 1299253 (5715)  +2.4%
  2 1734644 (11936) 1799225 (19577) +3.7%
  4 2821268 (41184) 2919132 (40149) +3.4%
  8 4171652 (31243) 4376925 (65416) +4.9%
  165590729 (24160) 5829866 (8127)  +4.2%
  246250212 (24481) 6522303 (28044) +4.3%
  323994314 (26606) 4077543 (10685) +2.0%
  484345177 (28091) 4417821 (41337) +1.6%

(Note that on configurations with up to 24 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   -   --  --
  1 1598896 (5174)  1607903 (4091)  +0.5%
  2 2109472 (17827) 2224726 (4372)  +5.4%
  4 3448587 (11952) 3668551 (30219) +6.3%
  8 5425778 (29641) 5606266 (33519) +3.3%
  166931232 (34677) 7054052 (27873) +1.7%
  247612473 (23482) 7783138 (13871) +2.2%
  324296274 (18029) 4283279 (32323) -0.3%
  484770029 (35541) 4764760 (13575) -0.1%

Presumably, PTI requires two invalidations of each mapping, which allows
to get higher benefits from concurrency when PTI is on. At the same
time, when mitigations are on, other overheads reduce the potential
speedup.

I tried to reduce the size of the code of the main patch, which required
restructuring of the series.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c |  13 ++-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  48 +
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c |  11 +-
 arch/x86/kernel/paravirt.c|   2 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 147 --
 arch/x86/xen/mmu_pv.c |  20 ++--
 include/linux/cpumask.h   |   6 +-
 include/linux/smp.h   |  27 +++--
 include/trace/events/xen.h|   2 +-
 kernel/smp.c  | 133 +++
 14 files changed, 245 insertions(+), 178 deletions(-)

Re: [PATCH 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-06-25 Thread Nadav Amit via Virtualization
> On Jun 25, 2019, at 8:36 PM, Andy Lutomirski  wrote:
> 
> On Wed, Jun 12, 2019 at 11:49 PM Nadav Amit  wrote:
>> To improve TLB shootdown performance, flush the remote and local TLBs
>> concurrently. Introduce flush_tlb_multi() that does so. The current
>> flush_tlb_others() interface is kept, since paravirtual interfaces need
>> to be adapted first before it can be removed. This is left for future
>> work. In such PV environments, TLB flushes are not performed, at this
>> time, concurrently.
> 
> Would it be straightforward to have a default PV flush_tlb_multi()
> that uses flush_tlb_others() under the hood?

I prefer not to have a default PV implementation that should anyhow go away.

I can create unoptimized untested versions for Xen and Hyper-V, if you want.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-06-25 Thread Nadav Amit via Virtualization
> On Jun 25, 2019, at 8:00 PM, Dave Hansen  wrote:
> 
> On 6/25/19 7:35 PM, Nadav Amit wrote:
 const struct flush_tlb_info *f = info;
 +  enum tlb_flush_reason reason;
 +
 +  reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
>>> 
>>> Should we just add the "reason" to flush_tlb_info?  It's OK-ish to imply
>>> it like this, but seems like it would be nicer and easier to track down
>>> the origins of these things if we did this at the caller.
>> 
>> I prefer not to. I want later to inline flush_tlb_info into the same
>> cacheline that holds call_function_data. Increasing the size of
>> flush_tlb_info for no good reason will not help…
> 
> Well, flush_tlb_info is at 6/8ths of a cacheline at the moment.
> call_function_data is 3/8ths.  To me, that means we have some slack in
> the size.

I do not understand your math.. :(

6 + 3 > 8 so putting both flush_tlb_info and call_function_data does not
leave us any slack (we can save one qword, so we can actually put them
at the same cacheline).

You can see my current implementation here:

https://lore.kernel.org/lkml/20190531063645.4697-4-na...@vmware.com/T/#m0ab5fe0799ba9ff0d41197f1095679fe26aebd57
https://lore.kernel.org/lkml/20190531063645.4697-4-na...@vmware.com/T/#m7b35a93dffd23fbb7ca813c795a0777d4cdcb51b

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-06-25 Thread Nadav Amit via Virtualization
> On Jun 25, 2019, at 2:29 PM, Dave Hansen  wrote:
> 
> On 6/12/19 11:48 PM, Nadav Amit wrote:
>> To improve TLB shootdown performance, flush the remote and local TLBs
>> concurrently. Introduce flush_tlb_multi() that does so. The current
>> flush_tlb_others() interface is kept, since paravirtual interfaces need
>> to be adapted first before it can be removed. This is left for future
>> work. In such PV environments, TLB flushes are not performed, at this
>> time, concurrently.
>> 
>> Add a static key to tell whether this new interface is supported.
>> 
>> Cc: "K. Y. Srinivasan" 
>> Cc: Haiyang Zhang 
>> Cc: Stephen Hemminger 
>> Cc: Sasha Levin 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> Cc: Borislav Petkov 
>> Cc: x...@kernel.org
>> Cc: Juergen Gross 
>> Cc: Paolo Bonzini 
>> Cc: Dave Hansen 
>> Cc: Andy Lutomirski 
>> Cc: Peter Zijlstra 
>> Cc: Boris Ostrovsky 
>> Cc: linux-hyp...@vger.kernel.org
>> Cc: linux-ker...@vger.kernel.org
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: k...@vger.kernel.org
>> Cc: xen-de...@lists.xenproject.org
>> Signed-off-by: Nadav Amit 
>> ---
>> arch/x86/hyperv/mmu.c |  2 +
>> arch/x86/include/asm/paravirt.h   |  8 +++
>> arch/x86/include/asm/paravirt_types.h |  6 +++
>> arch/x86/include/asm/tlbflush.h   |  6 +++
>> arch/x86/kernel/kvm.c |  1 +
>> arch/x86/kernel/paravirt.c|  3 ++
>> arch/x86/mm/tlb.c | 71 ++-
>> arch/x86/xen/mmu_pv.c |  2 +
>> 8 files changed, 87 insertions(+), 12 deletions(-)
>> 
>> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> index e65d7fe6489f..ca28b400c87c 100644
>> --- a/arch/x86/hyperv/mmu.c
>> +++ b/arch/x86/hyperv/mmu.c
>> @@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
>>  pr_info("Using hypercall for remote TLB flush\n");
>>  pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
>>  pv_ops.mmu.tlb_remove_table = tlb_remove_table;
>> +
>> +static_key_disable(_tlb_multi_enabled.key);
>> }
>> diff --git a/arch/x86/include/asm/paravirt.h 
>> b/arch/x86/include/asm/paravirt.h
>> index c25c38a05c1c..192be7254457 100644
>> --- a/arch/x86/include/asm/paravirt.h
>> +++ b/arch/x86/include/asm/paravirt.h
>> @@ -47,6 +47,8 @@ static inline void slow_down_io(void)
>> #endif
>> }
>> 
>> +DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
>> +
>> static inline void __flush_tlb(void)
>> {
>>  PVOP_VCALL0(mmu.flush_tlb_user);
>> @@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long 
>> addr)
>>  PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
>> }
>> 
>> +static inline void flush_tlb_multi(const struct cpumask *cpumask,
>> +   const struct flush_tlb_info *info)
>> +{
>> +PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
>> +}
>> +
>> static inline void flush_tlb_others(const struct cpumask *cpumask,
>>  const struct flush_tlb_info *info)
>> {
>> diff --git a/arch/x86/include/asm/paravirt_types.h 
>> b/arch/x86/include/asm/paravirt_types.h
>> index 946f8f1f1efc..b93b3d90729a 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>  void (*flush_tlb_user)(void);
>>  void (*flush_tlb_kernel)(void);
>>  void (*flush_tlb_one_user)(unsigned long addr);
>> +/*
>> + * flush_tlb_multi() is the preferred interface, which is capable to
>> + * flush both local and remote CPUs.
>> + */
>> +void (*flush_tlb_multi)(const struct cpumask *cpus,
>> +const struct flush_tlb_info *info);
>>  void (*flush_tlb_others)(const struct cpumask *cpus,
>>   const struct flush_tlb_info *info);
>> 
>> diff --git a/arch/x86/include/asm/tlbflush.h 
>> b/arch/x86/include/asm/tlbflush.h
>> index dee375831962..79272938cf79 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct 
>> *vma, unsigned long a)
>>  flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
>> }
>> 
>> +void native_flush_tlb_multi(const struct cpumask *cpumask,
>> + const struct flush_tlb_info *info);
>> +
>> void native_flush_tlb_others(const struct cpumask *cpumask,
>>   const struct flush_tlb_info *info);
>> 
>> @@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct 
>> arch_tlbflush_unmap_batch *batch,
>> extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
>> 
>> #ifndef CONFIG_PARAVIRT
>> +#define flush_tlb_multi(mask, info) \
>> +native_flush_tlb_multi(mask, info)
>> +
>> #define flush_tlb_others(mask, info) \
>>  native_flush_tlb_others(mask, info)
>> 
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 5169b8cc35bb..00d81e898717 100644
>> --- 

[PATCH 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-06-13 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. The current
flush_tlb_others() interface is kept, since paravirtual interfaces need
to be adapted first before it can be removed. This is left for future
work. In such PV environments, TLB flushes are not performed, at this
time, concurrently.

Add a static key to tell whether this new interface is supported.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c |  2 +
 arch/x86/include/asm/paravirt.h   |  8 +++
 arch/x86/include/asm/paravirt_types.h |  6 +++
 arch/x86/include/asm/tlbflush.h   |  6 +++
 arch/x86/kernel/kvm.c |  1 +
 arch/x86/kernel/paravirt.c|  3 ++
 arch/x86/mm/tlb.c | 71 ++-
 arch/x86/xen/mmu_pv.c |  2 +
 8 files changed, 87 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..ca28b400c87c 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
pr_info("Using hypercall for remote TLB flush\n");
pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+
+   static_key_disable(_tlb_multi_enabled.key);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..192be7254457 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -47,6 +47,8 @@ static inline void slow_down_io(void)
 #endif
 }
 
+DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static inline void __flush_tlb(void)
 {
PVOP_VCALL0(mmu.flush_tlb_user);
@@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
+{
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
+}
+
 static inline void flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..b93b3d90729a 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,6 +211,12 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
+   /*
+* flush_tlb_multi() is the preferred interface, which is capable to
+* flush both local and remote CPUs.
+*/
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   const struct flush_tlb_info *info);
void (*flush_tlb_others)(const struct cpumask *cpus,
 const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..79272938cf79 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+const struct flush_tlb_info *info);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
@@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct 
arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
+#define flush_tlb_multi(mask, info)\
+   native_flush_tlb_multi(mask, info)
+
 #define flush_tlb_others(mask, info)   \
native_flush_tlb_others(mask, info)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5169b8cc35bb..00d81e898717 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -630,6 +630,7 @@ static void __init kvm_guest_init(void)
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+   static_key_disable(_tlb_multi_enabled.key);
}
 
if 

Re: [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-31 Thread Nadav Amit via Virtualization
> On May 31, 2019, at 4:48 AM, Juergen Gross  wrote:
> 
> On 31/05/2019 08:36, Nadav Amit wrote:
>> 
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>  void (*flush_tlb_user)(void);
>>  void (*flush_tlb_kernel)(void);
>>  void (*flush_tlb_one_user)(unsigned long addr);
>> +/*
>> + * flush_tlb_multi() is the preferred interface. When it is used,
>> + * flush_tlb_others() should return false.
> 
> Didn't you want to remove/change this comment?

Yes! Sorry for that. Fixed now.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-31 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. The current
flush_tlb_others() interface is kept, since paravirtual interfaces need
to be adapted first before it can be removed. This is left for future
work. In such PV environments, TLB flushes are not performed, at this
time, concurrently.

Add a static key to tell whether this new interface is supported.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c |  2 +
 arch/x86/include/asm/paravirt.h   |  8 +++
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/tlbflush.h   |  6 ++
 arch/x86/kernel/kvm.c |  1 +
 arch/x86/kernel/paravirt.c|  3 +
 arch/x86/mm/tlb.c | 80 +++
 arch/x86/xen/mmu_pv.c |  2 +
 8 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..ca28b400c87c 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
pr_info("Using hypercall for remote TLB flush\n");
pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+
+   static_key_disable(_tlb_multi_enabled.key);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..192be7254457 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -47,6 +47,8 @@ static inline void slow_down_io(void)
 #endif
 }
 
+DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static inline void __flush_tlb(void)
 {
PVOP_VCALL0(mmu.flush_tlb_user);
@@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
+{
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
+}
+
 static inline void flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3a156e63c57d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,6 +211,12 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
+   /*
+* flush_tlb_multi() is the preferred interface. When it is used,
+* flush_tlb_others() should return false.
+*/
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   const struct flush_tlb_info *info);
void (*flush_tlb_others)(const struct cpumask *cpus,
 const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..79272938cf79 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+const struct flush_tlb_info *info);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
@@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct 
arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
+#define flush_tlb_multi(mask, info)\
+   native_flush_tlb_multi(mask, info)
+
 #define flush_tlb_others(mask, info)   \
native_flush_tlb_others(mask, info)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3f0cc828cc36..c1c2b88ea3f1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -643,6 +643,7 @@ static void __init kvm_guest_init(void)
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+   static_key_disable(_tlb_multi_enabled.key);
}
 
if 

Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-27 Thread Nadav Amit via Virtualization
> On May 27, 2019, at 2:47 AM, Peter Zijlstra  wrote:
> 
> On Sat, May 25, 2019 at 10:54:50AM +0200, Juergen Gross wrote:
>> On 25/05/2019 10:22, Nadav Amit wrote:
> 
>>> diff --git a/arch/x86/include/asm/paravirt_types.h 
>>> b/arch/x86/include/asm/paravirt_types.h
>>> index 946f8f1f1efc..3a156e63c57d 100644
>>> --- a/arch/x86/include/asm/paravirt_types.h
>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>> void (*flush_tlb_user)(void);
>>> void (*flush_tlb_kernel)(void);
>>> void (*flush_tlb_one_user)(unsigned long addr);
>>> +   /*
>>> +* flush_tlb_multi() is the preferred interface. When it is used,
>>> +* flush_tlb_others() should return false.
>> 
>> This comment does not make sense. flush_tlb_others() return type is
>> void.
> 
> I suspect that is an artifact from before the static_key; an attempt to
> make the pv interface less awkward.

Yes, remainders that should have been removed - I will remove them for the
next version.

> Something like the below would work for KVM I suspect, the others
> (Hyper-V and Xen are more 'interesting').
> 
> ---
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
> 
> static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
> 
> -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
>   const struct flush_tlb_info *info)
> {
>   u8 state;
> @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
>* queue flush_on_enter for pre-empted vCPUs
>*/
>   for_each_cpu(cpu, flushmask) {
> + if (cpu == smp_processor_id())
> + continue;
> +
>   src = _cpu(steal_time, cpu);
>   state = READ_ONCE(src->preempted);
>   if ((state & KVM_VCPU_PREEMPTED)) {
> @@ -603,7 +606,7 @@ static void kvm_flush_tlb_others(const s
>   }
>   }
> 
> - native_flush_tlb_others(flushmask, info);
> + native_flush_tlb_multi(flushmask, info);
> }
> 
> static void __init kvm_guest_init(void)
> @@ -628,9 +631,8 @@ static void __init kvm_guest_init(void)
>   if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
>   !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
>   kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
> - pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
> + pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
>   pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> - static_key_disable(_tlb_multi_enabled.key);
>   }
> 
>   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))

That’s what I have as well ;-).

As you mentioned (in another email), specifically hyper-v code seems
convoluted to me. In general, I prefer not to touch KVM/Xen/hyper-v, but you
twist my arm, I will send a compile-tested version for Xen and hyper-v.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-25 Thread Nadav Amit via Virtualization
> On May 25, 2019, at 1:22 AM, Nadav Amit  wrote:
> 
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. The current
> flush_tlb_others() interface is kept, since paravirtual interfaces need
> to be adapted first before it can be removed. This is left for future
> work. In such PV environments, TLB flushes are not performed, at this
> time, concurrently.
> 
> +void native_flush_tlb_multi(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> {
> + /*
> +  * native_flush_tlb_multi() can handle a single CPU, but it is
> +  * suboptimal if the local TLB should be flushed, and therefore should
> +  * not be used in such case. Check that it is not used in such case,
> +  * and use this assumption for tracing and accounting of remote TLB
> +  * flushes.
> +  */
> + VM_WARN_ON(!cpumask_any_but(cpumask, smp_processor_id()));

This warning might fire off incorrectly and will be removed.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-25 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. The current
flush_tlb_others() interface is kept, since paravirtual interfaces need
to be adapted first before it can be removed. This is left for future
work. In such PV environments, TLB flushes are not performed, at this
time, concurrently.

Add a static key to tell whether this new interface is supported.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c |  2 +
 arch/x86/include/asm/paravirt.h   |  8 +++
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/tlbflush.h   |  6 ++
 arch/x86/kernel/kvm.c |  1 +
 arch/x86/kernel/paravirt.c|  3 +
 arch/x86/mm/tlb.c | 80 +++
 arch/x86/xen/mmu_pv.c |  2 +
 8 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..ca28b400c87c 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
pr_info("Using hypercall for remote TLB flush\n");
pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+
+   static_key_disable(_tlb_multi_enabled.key);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..192be7254457 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -47,6 +47,8 @@ static inline void slow_down_io(void)
 #endif
 }
 
+DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static inline void __flush_tlb(void)
 {
PVOP_VCALL0(mmu.flush_tlb_user);
@@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
+{
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
+}
+
 static inline void flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3a156e63c57d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,6 +211,12 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
+   /*
+* flush_tlb_multi() is the preferred interface. When it is used,
+* flush_tlb_others() should return false.
+*/
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   const struct flush_tlb_info *info);
void (*flush_tlb_others)(const struct cpumask *cpus,
 const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..79272938cf79 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+const struct flush_tlb_info *info);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
@@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct 
arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
+#define flush_tlb_multi(mask, info)\
+   native_flush_tlb_multi(mask, info)
+
 #define flush_tlb_others(mask, info)   \
native_flush_tlb_others(mask, info)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3f0cc828cc36..c1c2b88ea3f1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -643,6 +643,7 @@ static void __init kvm_guest_init(void)
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+   static_key_disable(_tlb_multi_enabled.key);
}
 
if 

Re: [PATCH v4 0/4] vmw_balloon: Compaction and shrinker support

2019-05-03 Thread Nadav Amit via Virtualization
> On Apr 25, 2019, at 4:54 AM, Nadav Amit  wrote:
> 
> VMware balloon enhancements: adding support for memory compaction,
> memory shrinker (to prevent OOM) and splitting of refused pages to
> prevent recurring inflations.
> 
> Patches 1-2: Support for compaction
> Patch 3: Support for memory shrinker - disabled by default
> Patch 4: Split refused pages to improve performance
> 
> v3->v4:
> * "get around to" comment [Michael]
> * Put list_add under page lock [Michael]
> 
> v2->v3:
> * Fixing wrong argument type (int->size_t) [Michael]
> * Fixing a comment (it) [Michael]
> * Reinstating the BUG_ON() when page is locked [Michael] 
> 
> v1->v2:
> * Return number of pages in list enqueue/dequeue interfaces [Michael]
> * Removed first two patches which were already merged
> 
> Nadav Amit (4):
>  mm/balloon_compaction: List interfaces
>  vmw_balloon: Compaction support
>  vmw_balloon: Add memory shrinker
>  vmw_balloon: Split refused pages
> 
> drivers/misc/Kconfig   |   1 +
> drivers/misc/vmw_balloon.c | 489 ++---
> include/linux/balloon_compaction.h |   4 +
> mm/balloon_compaction.c| 144 ++---
> 4 files changed, 553 insertions(+), 85 deletions(-)
> 
> -- 
> 2.19.1

Ping.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 4/4] vmw_balloon: Split refused pages

2019-04-25 Thread Nadav Amit via Virtualization
The hypervisor might refuse to inflate pages. While the balloon driver
handles this scenario correctly, a refusal to inflate a 2MB pages might
cause the same page to be allocated again later just for its inflation
to be refused again. This wastes energy and time.

To avoid this situation, split the 2MB page to 4KB pages, and then try
to inflate each one individually. Most of the 4KB pages out of the 2MB
should be inflated successfully, and the balloon is likely to prevent
the scenario of repeated refused inflation.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/vmw_balloon.c | 63 +++---
 1 file changed, 52 insertions(+), 11 deletions(-)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 4b5e939ff4c8..043eed845246 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -239,6 +239,7 @@ static DEFINE_STATIC_KEY_FALSE(balloon_stat_enabled);
 struct vmballoon_ctl {
struct list_head pages;
struct list_head refused_pages;
+   struct list_head prealloc_pages;
unsigned int n_refused_pages;
unsigned int n_pages;
enum vmballoon_page_size_type page_size;
@@ -668,15 +669,25 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
unsigned int i;
 
for (i = 0; i < req_n_pages; i++) {
-   if (ctl->page_size == VMW_BALLOON_2M_PAGE)
-   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
+   /*
+* First check if we happen to have pages that were allocated
+* before. This happens when 2MB page rejected during inflation
+* by the hypervisor, and then split into 4KB pages.
+*/
+   if (!list_empty(>prealloc_pages)) {
+   page = list_first_entry(>prealloc_pages,
+   struct page, lru);
+   list_del(>lru);
+   } else {
+   if (ctl->page_size == VMW_BALLOON_2M_PAGE)
+   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
__GFP_NOMEMALLOC, VMW_BALLOON_2M_ORDER);
-   else
-   page = balloon_page_alloc();
+   else
+   page = balloon_page_alloc();
 
-   /* Update statistics */
-   vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
-ctl->page_size);
+   vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
+ctl->page_size);
+   }
 
if (page) {
vmballoon_mark_page_offline(page, ctl->page_size);
@@ -922,7 +933,8 @@ static void vmballoon_release_page_list(struct list_head 
*page_list,
__free_pages(page, vmballoon_page_order(page_size));
}
 
-   *n_pages = 0;
+   if (n_pages)
+   *n_pages = 0;
 }
 
 
@@ -1054,6 +1066,32 @@ static void vmballoon_dequeue_page_list(struct vmballoon 
*b,
*n_pages = i;
 }
 
+/**
+ * vmballoon_split_refused_pages() - Split the 2MB refused pages to 4k.
+ *
+ * If inflation of 2MB pages was denied by the hypervisor, it is likely to be
+ * due to one or few 4KB pages. These 2MB pages may keep being allocated and
+ * then being refused. To prevent this case, this function splits the refused
+ * pages into 4KB pages and adds them into @prealloc_pages list.
+ *
+ * @ctl: pointer for the %struct vmballoon_ctl, which defines the operation.
+ */
+static void vmballoon_split_refused_pages(struct vmballoon_ctl *ctl)
+{
+   struct page *page, *tmp;
+   unsigned int i, order;
+
+   order = vmballoon_page_order(ctl->page_size);
+
+   list_for_each_entry_safe(page, tmp, >refused_pages, lru) {
+   list_del(>lru);
+   split_page(page, order);
+   for (i = 0; i < (1 << order); i++)
+   list_add([i].lru, >prealloc_pages);
+   }
+   ctl->n_refused_pages = 0;
+}
+
 /**
  * vmballoon_inflate() - Inflate the balloon towards its target size.
  *
@@ -1065,6 +1103,7 @@ static void vmballoon_inflate(struct vmballoon *b)
struct vmballoon_ctl ctl = {
.pages = LIST_HEAD_INIT(ctl.pages),
.refused_pages = LIST_HEAD_INIT(ctl.refused_pages),
+   .prealloc_pages = LIST_HEAD_INIT(ctl.prealloc_pages),
.page_size = b->max_page_size,
.op = VMW_BALLOON_INFLATE
};
@@ -1112,10 +1151,10 @@ static void vmballoon_inflate(struct vmballoon *b)
break;
 
/*
-* Ignore errors from locking as we now switch to 4k
-* pages and we might get different errors.
+* Split the refused pages to 4k. 

[PATCH v4 2/4] vmw_balloon: Compaction support

2019-04-25 Thread Nadav Amit via Virtualization
Add support for compaction for VMware balloon. Since unlike the virtio
balloon, we also support huge-pages, which are not going through
compaction, we keep these pages in vmballoon and handle this list
separately. We use the same lock to protect both lists, as this lock is
not supposed to be contended.

Doing so also eliminates the need for the page_size lists. We update the
accounting as needed to reflect inflation, deflation and migration to be
reflected in vmstat.

Since VMware balloon now provides statistics for inflation, deflation
and migration in vmstat, select MEMORY_BALLOON in Kconfig.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/Kconfig   |   1 +
 drivers/misc/vmw_balloon.c | 301 -
 2 files changed, 264 insertions(+), 38 deletions(-)

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 42ab8ec92a04..427cf10579b4 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -420,6 +420,7 @@ config SPEAR13XX_PCIE_GADGET
 config VMWARE_BALLOON
tristate "VMware Balloon Driver"
depends on VMWARE_VMCI && X86 && HYPERVISOR_GUEST
+   select MEMORY_BALLOON
help
  This is VMware physical memory management driver which acts
  like a "balloon" that can be inflated to reclaim physical pages
diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index ad807d5a3141..2136f6ad97d3 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -28,6 +28,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -38,25 +40,11 @@ MODULE_ALIAS("dmi:*:svnVMware*:*");
 MODULE_ALIAS("vmware_vmmemctl");
 MODULE_LICENSE("GPL");
 
-/*
- * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We don't allow wait
- * (__GFP_RECLAIM) for huge page allocations. Use __GFP_NOWARN, to suppress 
page
- * allocation failure warnings. Disallow access to emergency low-memory pools.
- */
-#define VMW_HUGE_PAGE_ALLOC_FLAGS  (__GFP_HIGHMEM|__GFP_NOWARN|\
-__GFP_NOMEMALLOC)
-
-/*
- * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We allow lightweight
- * reclamation (__GFP_NORETRY). Use __GFP_NOWARN, to suppress page allocation
- * failure warnings. Disallow access to emergency low-memory pools.
- */
-#define VMW_PAGE_ALLOC_FLAGS   (__GFP_HIGHMEM|__GFP_NOWARN|\
-__GFP_NOMEMALLOC|__GFP_NORETRY)
-
-/* Maximum number of refused pages we accumulate during inflation cycle */
 #define VMW_BALLOON_MAX_REFUSED16
 
+/* Magic number for the balloon mount-point */
+#define BALLOON_VMW_MAGIC  0x0ba11007
+
 /*
  * Hypervisor communication port definitions.
  */
@@ -247,11 +235,6 @@ struct vmballoon_ctl {
enum vmballoon_op op;
 };
 
-struct vmballoon_page_size {
-   /* list of reserved physical pages */
-   struct list_head pages;
-};
-
 /**
  * struct vmballoon_batch_entry - a batch entry for lock or unlock.
  *
@@ -266,8 +249,6 @@ struct vmballoon_batch_entry {
 } __packed;
 
 struct vmballoon {
-   struct vmballoon_page_size page_sizes[VMW_BALLOON_NUM_PAGE_SIZES];
-
/**
 * @max_page_size: maximum supported page size for ballooning.
 *
@@ -348,8 +329,20 @@ struct vmballoon {
struct dentry *dbg_entry;
 #endif
 
+   /**
+* @b_dev_info: balloon device information descriptor.
+*/
+   struct balloon_dev_info b_dev_info;
+
struct delayed_work dwork;
 
+   /**
+* @huge_pages - list of the inflated 2MB pages.
+*
+* Protected by @b_dev_info.pages_lock .
+*/
+   struct list_head huge_pages;
+
/**
 * @vmci_doorbell.
 *
@@ -643,10 +636,10 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
 
for (i = 0; i < req_n_pages; i++) {
if (ctl->page_size == VMW_BALLOON_2M_PAGE)
-   page = alloc_pages(VMW_HUGE_PAGE_ALLOC_FLAGS,
-  VMW_BALLOON_2M_ORDER);
+   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
+   __GFP_NOMEMALLOC, VMW_BALLOON_2M_ORDER);
else
-   page = alloc_page(VMW_PAGE_ALLOC_FLAGS);
+   page = balloon_page_alloc();
 
/* Update statistics */
vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
@@ -961,9 +954,22 @@ static void vmballoon_enqueue_page_list(struct vmballoon 
*b,
unsigned int *n_pages,
enum vmballoon_page_size_type page_size)
 {
-   struct vmballoon_page_size *page_size_info = >page_sizes[page_size];
+   unsigned long flags;
+
+   if (page_size == VMW_BALLOON_4K_PAGE) {
+   balloon_page_list_enqueue(>b_dev_info, pages);
+   } else {
+  

[PATCH v4 3/4] vmw_balloon: Add memory shrinker

2019-04-25 Thread Nadav Amit via Virtualization
Add a shrinker to the VMware balloon to prevent out-of-memory events.
We reuse the deflate logic for this matter. Deadlocks should not happen,
as no memory allocation is performed while the locks of the
communication (batch/page) and page-list are taken. In the unlikely
event in which the configuration semaphore is taken for write we bail
out and fail gracefully (causing processes to be killed).

Once the shrinker is called, inflation is postponed for few seconds.
The timeout is updated without any lock, but this should not cause any
races, as it is written and read atomically.

This feature is disabled by default, since it might cause performance
degradation.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/vmw_balloon.c | 133 -
 1 file changed, 131 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 2136f6ad97d3..4b5e939ff4c8 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -40,6 +40,15 @@ MODULE_ALIAS("dmi:*:svnVMware*:*");
 MODULE_ALIAS("vmware_vmmemctl");
 MODULE_LICENSE("GPL");
 
+static bool __read_mostly vmwballoon_shrinker_enable;
+module_param(vmwballoon_shrinker_enable, bool, 0444);
+MODULE_PARM_DESC(vmwballoon_shrinker_enable,
+   "Enable non-cooperative out-of-memory protection. Disabled by default 
as it may degrade performance.");
+
+/* Delay in seconds after shrink before inflation. */
+#define VMBALLOON_SHRINK_DELAY (5)
+
+/* Maximum number of refused pages we accumulate during inflation cycle */
 #define VMW_BALLOON_MAX_REFUSED16
 
 /* Magic number for the balloon mount-point */
@@ -217,12 +226,13 @@ enum vmballoon_stat_general {
VMW_BALLOON_STAT_TIMER,
VMW_BALLOON_STAT_DOORBELL,
VMW_BALLOON_STAT_RESET,
-   VMW_BALLOON_STAT_LAST = VMW_BALLOON_STAT_RESET
+   VMW_BALLOON_STAT_SHRINK,
+   VMW_BALLOON_STAT_SHRINK_FREE,
+   VMW_BALLOON_STAT_LAST = VMW_BALLOON_STAT_SHRINK_FREE
 };
 
 #define VMW_BALLOON_STAT_NUM   (VMW_BALLOON_STAT_LAST + 1)
 
-
 static DEFINE_STATIC_KEY_TRUE(vmw_balloon_batching);
 static DEFINE_STATIC_KEY_FALSE(balloon_stat_enabled);
 
@@ -321,6 +331,15 @@ struct vmballoon {
 */
struct page *page;
 
+   /**
+* @shrink_timeout: timeout until the next inflation.
+*
+* After an shrink event, indicates the time in jiffies after which
+* inflation is allowed again. Can be written concurrently with reads,
+* so must use READ_ONCE/WRITE_ONCE when accessing.
+*/
+   unsigned long shrink_timeout;
+
/* statistics */
struct vmballoon_stats *stats;
 
@@ -361,6 +380,20 @@ struct vmballoon {
 * Lock ordering: @conf_sem -> @comm_lock .
 */
spinlock_t comm_lock;
+
+   /**
+* @shrinker: shrinker interface that is used to avoid over-inflation.
+*/
+   struct shrinker shrinker;
+
+   /**
+* @shrinker_registered: whether the shrinker was registered.
+*
+* The shrinker interface does not handle gracefully the removal of
+* shrinker that was not registered before. This indication allows to
+* simplify the unregistration process.
+*/
+   bool shrinker_registered;
 };
 
 static struct vmballoon balloon;
@@ -935,6 +968,10 @@ static int64_t vmballoon_change(struct vmballoon *b)
size - target < vmballoon_page_in_frames(VMW_BALLOON_2M_PAGE))
return 0;
 
+   /* If an out-of-memory recently occurred, inflation is disallowed. */
+   if (target > size && time_before(jiffies, READ_ONCE(b->shrink_timeout)))
+   return 0;
+
return target - size;
 }
 
@@ -1430,6 +1467,90 @@ static void vmballoon_work(struct work_struct *work)
 
 }
 
+/**
+ * vmballoon_shrinker_scan() - deflate the balloon due to memory pressure.
+ * @shrinker: pointer to the balloon shrinker.
+ * @sc: page reclaim information.
+ *
+ * Returns: number of pages that were freed during deflation.
+ */
+static unsigned long vmballoon_shrinker_scan(struct shrinker *shrinker,
+struct shrink_control *sc)
+{
+   struct vmballoon *b = 
+   unsigned long deflated_frames;
+
+   pr_debug("%s - size: %llu", __func__, atomic64_read(>size));
+
+   vmballoon_stats_gen_inc(b, VMW_BALLOON_STAT_SHRINK);
+
+   /*
+* If the lock is also contended for read, we cannot easily reclaim and
+* we bail out.
+*/
+   if (!down_read_trylock(>conf_sem))
+   return 0;
+
+   deflated_frames = vmballoon_deflate(b, sc->nr_to_scan, true);
+
+   vmballoon_stats_gen_add(b, VMW_BALLOON_STAT_SHRINK_FREE,
+   deflated_frames);
+
+   /*
+* Delay future inflation for some time to mitigate the situations in
+* which balloon continuously grows and shrinks. Use 

[PATCH v4 1/4] mm/balloon_compaction: List interfaces

2019-04-25 Thread Nadav Amit via Virtualization
Introduce interfaces for ballooning enqueueing and dequeueing of a list
of pages. These interfaces reduce the overhead of storing and restoring
IRQs by batching the operations. In addition they do not panic if the
list of pages is empty.

Cc: Jason Wang 
Cc: linux...@kvack.org
Cc: virtualization@lists.linux-foundation.org
Acked-by: Michael S. Tsirkin 
Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 include/linux/balloon_compaction.h |   4 +
 mm/balloon_compaction.c| 144 +
 2 files changed, 110 insertions(+), 38 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index f111c780ef1d..430b6047cef7 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -64,6 +64,10 @@ extern struct page *balloon_page_alloc(void);
 extern void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
 struct page *page);
 extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info);
+extern size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
+ struct list_head *pages);
+extern size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages, size_t 
n_req_pages);
 
 static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 {
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index ef858d547e2d..b7bd72612c5a 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -10,6 +10,105 @@
 #include 
 #include 
 
+static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info,
+struct page *page)
+{
+   /*
+* Block others from accessing the 'page' when we get around to
+* establishing additional references. We should be the only one
+* holding a reference to the 'page' at this point. If we are not, then
+* memory corruption is possible and we should stop execution.
+*/
+   BUG_ON(!trylock_page(page));
+   list_del(>lru);
+   balloon_page_insert(b_dev_info, page);
+   unlock_page(page);
+   __count_vm_event(BALLOON_INFLATE);
+}
+
+/**
+ * balloon_page_list_enqueue() - inserts a list of pages into the balloon page
+ *  list.
+ * @b_dev_info: balloon device descriptor where we will insert a new page to
+ * @pages: pages to enqueue - allocated using balloon_page_alloc.
+ *
+ * Driver must call it to properly enqueue a balloon pages before definitively
+ * removing it from the guest system.
+ *
+ * Return: number of pages that were enqueued.
+ */
+size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages)
+{
+   struct page *page, *tmp;
+   unsigned long flags;
+   size_t n_pages = 0;
+
+   spin_lock_irqsave(_dev_info->pages_lock, flags);
+   list_for_each_entry_safe(page, tmp, pages, lru) {
+   balloon_page_enqueue_one(b_dev_info, page);
+   n_pages++;
+   }
+   spin_unlock_irqrestore(_dev_info->pages_lock, flags);
+   return n_pages;
+}
+EXPORT_SYMBOL_GPL(balloon_page_list_enqueue);
+
+/**
+ * balloon_page_list_dequeue() - removes pages from balloon's page list and
+ *  returns a list of the pages.
+ * @b_dev_info: balloon device decriptor where we will grab a page from.
+ * @pages: pointer to the list of pages that would be returned to the caller.
+ * @n_req_pages: number of requested pages.
+ *
+ * Driver must call this function to properly de-allocate a previous enlisted
+ * balloon pages before definetively releasing it back to the guest system.
+ * This function tries to remove @n_req_pages from the ballooned pages and
+ * return them to the caller in the @pages list.
+ *
+ * Note that this function may fail to dequeue some pages temporarily empty due
+ * to compaction isolated pages.
+ *
+ * Return: number of pages that were added to the @pages list.
+ */
+size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages, size_t n_req_pages)
+{
+   struct page *page, *tmp;
+   unsigned long flags;
+   size_t n_pages = 0;
+
+   spin_lock_irqsave(_dev_info->pages_lock, flags);
+   list_for_each_entry_safe(page, tmp, _dev_info->pages, lru) {
+   if (n_pages == n_req_pages)
+   break;
+
+   /*
+* Block others from accessing the 'page' while we get around to
+* establishing additional references and preparing the 'page'
+* to be released by the balloon driver.
+*/
+   if (!trylock_page(page))
+   continue;
+
+   if (IS_ENABLED(CONFIG_BALLOON_COMPACTION) &&
+   

[PATCH v4 0/4] vmw_balloon: Compaction and shrinker support

2019-04-25 Thread Nadav Amit via Virtualization
VMware balloon enhancements: adding support for memory compaction,
memory shrinker (to prevent OOM) and splitting of refused pages to
prevent recurring inflations.

Patches 1-2: Support for compaction
Patch 3: Support for memory shrinker - disabled by default
Patch 4: Split refused pages to improve performance

v3->v4:
* "get around to" comment [Michael]
* Put list_add under page lock [Michael]

v2->v3:
* Fixing wrong argument type (int->size_t) [Michael]
* Fixing a comment (it) [Michael]
* Reinstating the BUG_ON() when page is locked [Michael] 

v1->v2:
* Return number of pages in list enqueue/dequeue interfaces [Michael]
* Removed first two patches which were already merged

Nadav Amit (4):
  mm/balloon_compaction: List interfaces
  vmw_balloon: Compaction support
  vmw_balloon: Add memory shrinker
  vmw_balloon: Split refused pages

 drivers/misc/Kconfig   |   1 +
 drivers/misc/vmw_balloon.c | 489 ++---
 include/linux/balloon_compaction.h |   4 +
 mm/balloon_compaction.c| 144 ++---
 4 files changed, 553 insertions(+), 85 deletions(-)

-- 
2.19.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 1/4] mm/balloon_compaction: list interfaces

2019-04-25 Thread Nadav Amit via Virtualization
> On Apr 24, 2019, at 6:49 AM, Michael S. Tsirkin  wrote:
> 
> On Tue, Apr 23, 2019 at 04:45:28PM -0700, Nadav Amit wrote:
>> Introduce interfaces for ballooning enqueueing and dequeueing of a list
>> of pages. These interfaces reduce the overhead of storing and restoring
>> IRQs by batching the operations. In addition they do not panic if the
>> list of pages is empty.
>> 
>> Cc: "Michael S. Tsirkin" 
>> Cc: Jason Wang 
>> Cc: linux...@kvack.org
>> Cc: virtualization@lists.linux-foundation.org
>> Reviewed-by: Xavier Deguillard 
>> Signed-off-by: Nadav Amit 
> 
> 
> Looks good overall. Two minor comments below.
> 
> 
>> ---
>> include/linux/balloon_compaction.h |   4 +
>> mm/balloon_compaction.c| 144 +
>> 2 files changed, 110 insertions(+), 38 deletions(-)
>> 
>> diff --git a/include/linux/balloon_compaction.h 
>> b/include/linux/balloon_compaction.h
>> index f111c780ef1d..430b6047cef7 100644
>> --- a/include/linux/balloon_compaction.h
>> +++ b/include/linux/balloon_compaction.h
>> @@ -64,6 +64,10 @@ extern struct page *balloon_page_alloc(void);
>> extern void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
>>   struct page *page);
>> extern struct page *balloon_page_dequeue(struct balloon_dev_info 
>> *b_dev_info);
>> +extern size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
>> +  struct list_head *pages);
>> +extern size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
>> + struct list_head *pages, size_t 
>> n_req_pages);
>> 
>> static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
>> {
>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>> index ef858d547e2d..a2995002edc2 100644
>> --- a/mm/balloon_compaction.c
>> +++ b/mm/balloon_compaction.c
>> @@ -10,6 +10,105 @@
>> #include 
>> #include 
>> 
>> +static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info,
>> + struct page *page)
>> +{
>> +/*
>> + * Block others from accessing the 'page' when we get around to
>> + * establishing additional references. We should be the only one
>> + * holding a reference to the 'page' at this point. If we are not, then
>> + * memory corruption is possible and we should stop execution.
>> + */
>> +BUG_ON(!trylock_page(page));
>> +list_del(>lru);
>> +balloon_page_insert(b_dev_info, page);
>> +unlock_page(page);
>> +__count_vm_event(BALLOON_INFLATE);
>> +}
>> +
>> +/**
>> + * balloon_page_list_enqueue() - inserts a list of pages into the balloon 
>> page
>> + *   list.
>> + * @b_dev_info: balloon device descriptor where we will insert a new page to
>> + * @pages: pages to enqueue - allocated using balloon_page_alloc.
>> + *
>> + * Driver must call it to properly enqueue a balloon pages before 
>> definitively
>> + * removing it from the guest system.
>> + *
>> + * Return: number of pages that were enqueued.
>> + */
>> +size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
>> + struct list_head *pages)
>> +{
>> +struct page *page, *tmp;
>> +unsigned long flags;
>> +size_t n_pages = 0;
>> +
>> +spin_lock_irqsave(_dev_info->pages_lock, flags);
>> +list_for_each_entry_safe(page, tmp, pages, lru) {
>> +balloon_page_enqueue_one(b_dev_info, page);
>> +n_pages++;
>> +}
>> +spin_unlock_irqrestore(_dev_info->pages_lock, flags);
>> +return n_pages;
>> +}
>> +EXPORT_SYMBOL_GPL(balloon_page_list_enqueue);
>> +
>> +/**
>> + * balloon_page_list_dequeue() - removes pages from balloon's page list and
>> + *   returns a list of the pages.
>> + * @b_dev_info: balloon device decriptor where we will grab a page from.
>> + * @pages: pointer to the list of pages that would be returned to the 
>> caller.
>> + * @n_req_pages: number of requested pages.
>> + *
>> + * Driver must call this function to properly de-allocate a previous 
>> enlisted
>> + * balloon pages before definetively releasing it back to the guest system.
>> + * This function tries to remove @n_req_pages from the ballooned pages and
>> + * return them to the caller in the @pages list.
>> + *
>> + * Note that this function may fail to dequeue some pages temporarily empty 
>> due
>> + * to compaction isolated pages.
>> + *
>> + * Return: number of pages that were added to the @pages list.
>> + */
>> +size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
>> + struct list_head *pages, size_t n_req_pages)
>> +{
>> +struct page *page, *tmp;
>> +unsigned long flags;
>> +size_t n_pages = 0;
>> +
>> +spin_lock_irqsave(_dev_info->pages_lock, flags);
>> +list_for_each_entry_safe(page, tmp, _dev_info->pages, lru) {
>> +if (n_pages == n_req_pages)
>> +

[PATCH v3 1/4] mm/balloon_compaction: list interfaces

2019-04-24 Thread Nadav Amit via Virtualization
Introduce interfaces for ballooning enqueueing and dequeueing of a list
of pages. These interfaces reduce the overhead of storing and restoring
IRQs by batching the operations. In addition they do not panic if the
list of pages is empty.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: linux...@kvack.org
Cc: virtualization@lists.linux-foundation.org
Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 include/linux/balloon_compaction.h |   4 +
 mm/balloon_compaction.c| 144 +
 2 files changed, 110 insertions(+), 38 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index f111c780ef1d..430b6047cef7 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -64,6 +64,10 @@ extern struct page *balloon_page_alloc(void);
 extern void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
 struct page *page);
 extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info);
+extern size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
+ struct list_head *pages);
+extern size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages, size_t 
n_req_pages);
 
 static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 {
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index ef858d547e2d..a2995002edc2 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -10,6 +10,105 @@
 #include 
 #include 
 
+static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info,
+struct page *page)
+{
+   /*
+* Block others from accessing the 'page' when we get around to
+* establishing additional references. We should be the only one
+* holding a reference to the 'page' at this point. If we are not, then
+* memory corruption is possible and we should stop execution.
+*/
+   BUG_ON(!trylock_page(page));
+   list_del(>lru);
+   balloon_page_insert(b_dev_info, page);
+   unlock_page(page);
+   __count_vm_event(BALLOON_INFLATE);
+}
+
+/**
+ * balloon_page_list_enqueue() - inserts a list of pages into the balloon page
+ *  list.
+ * @b_dev_info: balloon device descriptor where we will insert a new page to
+ * @pages: pages to enqueue - allocated using balloon_page_alloc.
+ *
+ * Driver must call it to properly enqueue a balloon pages before definitively
+ * removing it from the guest system.
+ *
+ * Return: number of pages that were enqueued.
+ */
+size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages)
+{
+   struct page *page, *tmp;
+   unsigned long flags;
+   size_t n_pages = 0;
+
+   spin_lock_irqsave(_dev_info->pages_lock, flags);
+   list_for_each_entry_safe(page, tmp, pages, lru) {
+   balloon_page_enqueue_one(b_dev_info, page);
+   n_pages++;
+   }
+   spin_unlock_irqrestore(_dev_info->pages_lock, flags);
+   return n_pages;
+}
+EXPORT_SYMBOL_GPL(balloon_page_list_enqueue);
+
+/**
+ * balloon_page_list_dequeue() - removes pages from balloon's page list and
+ *  returns a list of the pages.
+ * @b_dev_info: balloon device decriptor where we will grab a page from.
+ * @pages: pointer to the list of pages that would be returned to the caller.
+ * @n_req_pages: number of requested pages.
+ *
+ * Driver must call this function to properly de-allocate a previous enlisted
+ * balloon pages before definetively releasing it back to the guest system.
+ * This function tries to remove @n_req_pages from the ballooned pages and
+ * return them to the caller in the @pages list.
+ *
+ * Note that this function may fail to dequeue some pages temporarily empty due
+ * to compaction isolated pages.
+ *
+ * Return: number of pages that were added to the @pages list.
+ */
+size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages, size_t n_req_pages)
+{
+   struct page *page, *tmp;
+   unsigned long flags;
+   size_t n_pages = 0;
+
+   spin_lock_irqsave(_dev_info->pages_lock, flags);
+   list_for_each_entry_safe(page, tmp, _dev_info->pages, lru) {
+   if (n_pages == n_req_pages)
+   break;
+
+   /*
+* Block others from accessing the 'page' while we get around
+* establishing additional references and preparing the 'page'
+* to be released by the balloon driver.
+*/
+   if (!trylock_page(page))
+   continue;
+
+   if (IS_ENABLED(CONFIG_BALLOON_COMPACTION) &&
+   

[PATCH v3 2/4] vmw_balloon: compaction support

2019-04-24 Thread Nadav Amit via Virtualization
Add support for compaction for VMware balloon. Since unlike the virtio
balloon, we also support huge-pages, which are not going through
compaction, we keep these pages in vmballoon and handle this list
separately. We use the same lock to protect both lists, as this lock is
not supposed to be contended.

Doing so also eliminates the need for the page_size lists. We update the
accounting as needed to reflect inflation, deflation and migration to be
reflected in vmstat.

Since VMware balloon now provides statistics for inflation, deflation
and migration in vmstat, select MEMORY_BALLOON in Kconfig.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/Kconfig   |   1 +
 drivers/misc/vmw_balloon.c | 301 -
 2 files changed, 264 insertions(+), 38 deletions(-)

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 42ab8ec92a04..427cf10579b4 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -420,6 +420,7 @@ config SPEAR13XX_PCIE_GADGET
 config VMWARE_BALLOON
tristate "VMware Balloon Driver"
depends on VMWARE_VMCI && X86 && HYPERVISOR_GUEST
+   select MEMORY_BALLOON
help
  This is VMware physical memory management driver which acts
  like a "balloon" that can be inflated to reclaim physical pages
diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index ad807d5a3141..2136f6ad97d3 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -28,6 +28,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -38,25 +40,11 @@ MODULE_ALIAS("dmi:*:svnVMware*:*");
 MODULE_ALIAS("vmware_vmmemctl");
 MODULE_LICENSE("GPL");
 
-/*
- * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We don't allow wait
- * (__GFP_RECLAIM) for huge page allocations. Use __GFP_NOWARN, to suppress 
page
- * allocation failure warnings. Disallow access to emergency low-memory pools.
- */
-#define VMW_HUGE_PAGE_ALLOC_FLAGS  (__GFP_HIGHMEM|__GFP_NOWARN|\
-__GFP_NOMEMALLOC)
-
-/*
- * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We allow lightweight
- * reclamation (__GFP_NORETRY). Use __GFP_NOWARN, to suppress page allocation
- * failure warnings. Disallow access to emergency low-memory pools.
- */
-#define VMW_PAGE_ALLOC_FLAGS   (__GFP_HIGHMEM|__GFP_NOWARN|\
-__GFP_NOMEMALLOC|__GFP_NORETRY)
-
-/* Maximum number of refused pages we accumulate during inflation cycle */
 #define VMW_BALLOON_MAX_REFUSED16
 
+/* Magic number for the balloon mount-point */
+#define BALLOON_VMW_MAGIC  0x0ba11007
+
 /*
  * Hypervisor communication port definitions.
  */
@@ -247,11 +235,6 @@ struct vmballoon_ctl {
enum vmballoon_op op;
 };
 
-struct vmballoon_page_size {
-   /* list of reserved physical pages */
-   struct list_head pages;
-};
-
 /**
  * struct vmballoon_batch_entry - a batch entry for lock or unlock.
  *
@@ -266,8 +249,6 @@ struct vmballoon_batch_entry {
 } __packed;
 
 struct vmballoon {
-   struct vmballoon_page_size page_sizes[VMW_BALLOON_NUM_PAGE_SIZES];
-
/**
 * @max_page_size: maximum supported page size for ballooning.
 *
@@ -348,8 +329,20 @@ struct vmballoon {
struct dentry *dbg_entry;
 #endif
 
+   /**
+* @b_dev_info: balloon device information descriptor.
+*/
+   struct balloon_dev_info b_dev_info;
+
struct delayed_work dwork;
 
+   /**
+* @huge_pages - list of the inflated 2MB pages.
+*
+* Protected by @b_dev_info.pages_lock .
+*/
+   struct list_head huge_pages;
+
/**
 * @vmci_doorbell.
 *
@@ -643,10 +636,10 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
 
for (i = 0; i < req_n_pages; i++) {
if (ctl->page_size == VMW_BALLOON_2M_PAGE)
-   page = alloc_pages(VMW_HUGE_PAGE_ALLOC_FLAGS,
-  VMW_BALLOON_2M_ORDER);
+   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
+   __GFP_NOMEMALLOC, VMW_BALLOON_2M_ORDER);
else
-   page = alloc_page(VMW_PAGE_ALLOC_FLAGS);
+   page = balloon_page_alloc();
 
/* Update statistics */
vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
@@ -961,9 +954,22 @@ static void vmballoon_enqueue_page_list(struct vmballoon 
*b,
unsigned int *n_pages,
enum vmballoon_page_size_type page_size)
 {
-   struct vmballoon_page_size *page_size_info = >page_sizes[page_size];
+   unsigned long flags;
+
+   if (page_size == VMW_BALLOON_4K_PAGE) {
+   balloon_page_list_enqueue(>b_dev_info, pages);
+   } else {
+  

[PATCH v3 3/4] vmw_balloon: add memory shrinker

2019-04-24 Thread Nadav Amit via Virtualization
Add a shrinker to the VMware balloon to prevent out-of-memory events.
We reuse the deflate logic for this matter. Deadlocks should not happen,
as no memory allocation is performed while the locks of the
communication (batch/page) and page-list are taken. In the unlikely
event in which the configuration semaphore is taken for write we bail
out and fail gracefully (causing processes to be killed).

Once the shrinker is called, inflation is postponed for few seconds.
The timeout is updated without any lock, but this should not cause any
races, as it is written and read atomically.

This feature is disabled by default, since it might cause performance
degradation.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/vmw_balloon.c | 133 -
 1 file changed, 131 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 2136f6ad97d3..4b5e939ff4c8 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -40,6 +40,15 @@ MODULE_ALIAS("dmi:*:svnVMware*:*");
 MODULE_ALIAS("vmware_vmmemctl");
 MODULE_LICENSE("GPL");
 
+static bool __read_mostly vmwballoon_shrinker_enable;
+module_param(vmwballoon_shrinker_enable, bool, 0444);
+MODULE_PARM_DESC(vmwballoon_shrinker_enable,
+   "Enable non-cooperative out-of-memory protection. Disabled by default 
as it may degrade performance.");
+
+/* Delay in seconds after shrink before inflation. */
+#define VMBALLOON_SHRINK_DELAY (5)
+
+/* Maximum number of refused pages we accumulate during inflation cycle */
 #define VMW_BALLOON_MAX_REFUSED16
 
 /* Magic number for the balloon mount-point */
@@ -217,12 +226,13 @@ enum vmballoon_stat_general {
VMW_BALLOON_STAT_TIMER,
VMW_BALLOON_STAT_DOORBELL,
VMW_BALLOON_STAT_RESET,
-   VMW_BALLOON_STAT_LAST = VMW_BALLOON_STAT_RESET
+   VMW_BALLOON_STAT_SHRINK,
+   VMW_BALLOON_STAT_SHRINK_FREE,
+   VMW_BALLOON_STAT_LAST = VMW_BALLOON_STAT_SHRINK_FREE
 };
 
 #define VMW_BALLOON_STAT_NUM   (VMW_BALLOON_STAT_LAST + 1)
 
-
 static DEFINE_STATIC_KEY_TRUE(vmw_balloon_batching);
 static DEFINE_STATIC_KEY_FALSE(balloon_stat_enabled);
 
@@ -321,6 +331,15 @@ struct vmballoon {
 */
struct page *page;
 
+   /**
+* @shrink_timeout: timeout until the next inflation.
+*
+* After an shrink event, indicates the time in jiffies after which
+* inflation is allowed again. Can be written concurrently with reads,
+* so must use READ_ONCE/WRITE_ONCE when accessing.
+*/
+   unsigned long shrink_timeout;
+
/* statistics */
struct vmballoon_stats *stats;
 
@@ -361,6 +380,20 @@ struct vmballoon {
 * Lock ordering: @conf_sem -> @comm_lock .
 */
spinlock_t comm_lock;
+
+   /**
+* @shrinker: shrinker interface that is used to avoid over-inflation.
+*/
+   struct shrinker shrinker;
+
+   /**
+* @shrinker_registered: whether the shrinker was registered.
+*
+* The shrinker interface does not handle gracefully the removal of
+* shrinker that was not registered before. This indication allows to
+* simplify the unregistration process.
+*/
+   bool shrinker_registered;
 };
 
 static struct vmballoon balloon;
@@ -935,6 +968,10 @@ static int64_t vmballoon_change(struct vmballoon *b)
size - target < vmballoon_page_in_frames(VMW_BALLOON_2M_PAGE))
return 0;
 
+   /* If an out-of-memory recently occurred, inflation is disallowed. */
+   if (target > size && time_before(jiffies, READ_ONCE(b->shrink_timeout)))
+   return 0;
+
return target - size;
 }
 
@@ -1430,6 +1467,90 @@ static void vmballoon_work(struct work_struct *work)
 
 }
 
+/**
+ * vmballoon_shrinker_scan() - deflate the balloon due to memory pressure.
+ * @shrinker: pointer to the balloon shrinker.
+ * @sc: page reclaim information.
+ *
+ * Returns: number of pages that were freed during deflation.
+ */
+static unsigned long vmballoon_shrinker_scan(struct shrinker *shrinker,
+struct shrink_control *sc)
+{
+   struct vmballoon *b = 
+   unsigned long deflated_frames;
+
+   pr_debug("%s - size: %llu", __func__, atomic64_read(>size));
+
+   vmballoon_stats_gen_inc(b, VMW_BALLOON_STAT_SHRINK);
+
+   /*
+* If the lock is also contended for read, we cannot easily reclaim and
+* we bail out.
+*/
+   if (!down_read_trylock(>conf_sem))
+   return 0;
+
+   deflated_frames = vmballoon_deflate(b, sc->nr_to_scan, true);
+
+   vmballoon_stats_gen_add(b, VMW_BALLOON_STAT_SHRINK_FREE,
+   deflated_frames);
+
+   /*
+* Delay future inflation for some time to mitigate the situations in
+* which balloon continuously grows and shrinks. Use 

[PATCH v3 4/4] vmw_balloon: split refused pages

2019-04-24 Thread Nadav Amit via Virtualization
The hypervisor might refuse to inflate pages. While the balloon driver
handles this scenario correctly, a refusal to inflate a 2MB pages might
cause the same page to be allocated again later just for its inflation
to be refused again. This wastes energy and time.

To avoid this situation, split the 2MB page to 4KB pages, and then try
to inflate each one individually. Most of the 4KB pages out of the 2MB
should be inflated successfully, and the balloon is likely to prevent
the scenario of repeated refused inflation.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/vmw_balloon.c | 63 +++---
 1 file changed, 52 insertions(+), 11 deletions(-)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 4b5e939ff4c8..043eed845246 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -239,6 +239,7 @@ static DEFINE_STATIC_KEY_FALSE(balloon_stat_enabled);
 struct vmballoon_ctl {
struct list_head pages;
struct list_head refused_pages;
+   struct list_head prealloc_pages;
unsigned int n_refused_pages;
unsigned int n_pages;
enum vmballoon_page_size_type page_size;
@@ -668,15 +669,25 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
unsigned int i;
 
for (i = 0; i < req_n_pages; i++) {
-   if (ctl->page_size == VMW_BALLOON_2M_PAGE)
-   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
+   /*
+* First check if we happen to have pages that were allocated
+* before. This happens when 2MB page rejected during inflation
+* by the hypervisor, and then split into 4KB pages.
+*/
+   if (!list_empty(>prealloc_pages)) {
+   page = list_first_entry(>prealloc_pages,
+   struct page, lru);
+   list_del(>lru);
+   } else {
+   if (ctl->page_size == VMW_BALLOON_2M_PAGE)
+   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
__GFP_NOMEMALLOC, VMW_BALLOON_2M_ORDER);
-   else
-   page = balloon_page_alloc();
+   else
+   page = balloon_page_alloc();
 
-   /* Update statistics */
-   vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
-ctl->page_size);
+   vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
+ctl->page_size);
+   }
 
if (page) {
vmballoon_mark_page_offline(page, ctl->page_size);
@@ -922,7 +933,8 @@ static void vmballoon_release_page_list(struct list_head 
*page_list,
__free_pages(page, vmballoon_page_order(page_size));
}
 
-   *n_pages = 0;
+   if (n_pages)
+   *n_pages = 0;
 }
 
 
@@ -1054,6 +1066,32 @@ static void vmballoon_dequeue_page_list(struct vmballoon 
*b,
*n_pages = i;
 }
 
+/**
+ * vmballoon_split_refused_pages() - Split the 2MB refused pages to 4k.
+ *
+ * If inflation of 2MB pages was denied by the hypervisor, it is likely to be
+ * due to one or few 4KB pages. These 2MB pages may keep being allocated and
+ * then being refused. To prevent this case, this function splits the refused
+ * pages into 4KB pages and adds them into @prealloc_pages list.
+ *
+ * @ctl: pointer for the %struct vmballoon_ctl, which defines the operation.
+ */
+static void vmballoon_split_refused_pages(struct vmballoon_ctl *ctl)
+{
+   struct page *page, *tmp;
+   unsigned int i, order;
+
+   order = vmballoon_page_order(ctl->page_size);
+
+   list_for_each_entry_safe(page, tmp, >refused_pages, lru) {
+   list_del(>lru);
+   split_page(page, order);
+   for (i = 0; i < (1 << order); i++)
+   list_add([i].lru, >prealloc_pages);
+   }
+   ctl->n_refused_pages = 0;
+}
+
 /**
  * vmballoon_inflate() - Inflate the balloon towards its target size.
  *
@@ -1065,6 +1103,7 @@ static void vmballoon_inflate(struct vmballoon *b)
struct vmballoon_ctl ctl = {
.pages = LIST_HEAD_INIT(ctl.pages),
.refused_pages = LIST_HEAD_INIT(ctl.refused_pages),
+   .prealloc_pages = LIST_HEAD_INIT(ctl.prealloc_pages),
.page_size = b->max_page_size,
.op = VMW_BALLOON_INFLATE
};
@@ -1112,10 +1151,10 @@ static void vmballoon_inflate(struct vmballoon *b)
break;
 
/*
-* Ignore errors from locking as we now switch to 4k
-* pages and we might get different errors.
+* Split the refused pages to 4k. 

[PATCH v3 0/4] vmw_balloon: compaction and shrinker support

2019-04-24 Thread Nadav Amit via Virtualization
VMware balloon enhancements: adding support for memory compaction,
memory shrinker (to prevent OOM) and splitting of refused pages to
prevent recurring inflations.

Patches 1-2: Support for compaction
Patch 3: Support for memory shrinker - disabled by default
Patch 4: Split refused pages to improve performance

v2->v3:
* Fixing wrong argument type (int->size_t) [Michael]
* Fixing a comment (it) [Michael]
* Reinstating the BUG_ON() when page is locked [Michael] 

v1->v2:
* Return number of pages in list enqueue/dequeue interfaces [Michael]
* Removed first two patches which were already merged

Nadav Amit (4):
  mm/balloon_compaction: list interfaces
  vmw_balloon: compaction support
  vmw_balloon: add memory shrinker
  vmw_balloon: split refused pages

 drivers/misc/Kconfig   |   1 +
 drivers/misc/vmw_balloon.c | 489 ++---
 include/linux/balloon_compaction.h |   4 +
 mm/balloon_compaction.c| 144 ++---
 4 files changed, 553 insertions(+), 85 deletions(-)

-- 
2.19.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization