Re: [PATCH 2/2] x86/paravirt: Make the struct paravirt_patch_site packed

2023-06-09 Thread Nadav Amit via Virtualization


> On Jun 9, 2023, at 2:45 AM, Hou Wenlong  wrote:
> 
> Similar to struct alt_instr, make the struct paravirt_patch_site packed
> and get rid of all the .align directives. This could save 2 bytes for
> one entry on X86_64.

Thanks for taking care of it.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH RFC 13/43] x86/paravirt: Use relative reference for original instruction

2023-06-05 Thread Nadav Amit via Virtualization



> On Jun 1, 2023, at 2:29 AM, Juergen Gross  wrote:
> 
> On 28.04.23 11:50, Hou Wenlong wrote:
>> Similar to the alternative patching, use relative reference for original
>> instruction rather than absolute one, which saves 8 bytes for one entry
>> on x86_64.  And it could generate R_X86_64_PC32 relocation instead of
>> R_X86_64_64 relocation, which also reduces relocation metadata on
>> relocatable builds. And the alignment could be hard coded to be 4 now.
>> Signed-off-by: Hou Wenlong 
>> Cc: Thomas Garnier 
>> Cc: Lai Jiangshan 
>> Cc: Kees Cook 
> 
> Reviewed-by: Juergen Gross 
> 
> I think this patch should be taken even without the series.

It looks good to me, I am just not sure what the alignment is needed
at all.

Why not to make the struct __packed (like struct alt_instr) and get rid
of all the .align directives? Am I missing something?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

2022-10-10 Thread Nadav Amit
On Oct 10, 2022, at 12:24 AM, Alexander Atanasov 
 wrote:

> Hello,
> 
> On 10.10.22 9:18, Nadav Amit wrote:
>> On Oct 7, 2022, at 3:58 AM, Alexander Atanasov 
>>  wrote:
>>> So all balloon drivers give large amount of RAM on boot , then inflate the 
>>> balloon. But this places have already been initiallized and they know that 
>>> the system have given amount of totalram which is not true the moment they 
>>> start to operate. the result is that too much space gets used and it 
>>> degrades the userspace performance.
>>> example - fs/eventpoll.c:static int __init eventpoll_init(void) - 4% of ram 
>>> for eventpool - when you inflate half of the ram it becomes 8% of the ram - 
>>> do you really need 8% of your ram to be used for eventpool?
>>> 
>>> To solve this you need to register and when notified update - cache size, 
>>> limits and for whatever is the calculated amount of memory used.
>> Hmm.. Not sure about all of that. Most balloon drivers are manually managed,
>> and call adjust_managed_page_count(), and tas a result might want to redo
>> all the calculations that are based on totalram_pages().
> 
> Yes, i will say that it looks like mixed manual - for large changes and 
> automatic for small changes. VMWare and HyperV have automatic and manual/not 
> sure exactly what you can change on a running VM but i guess you can/ - 
> Virtio is only manual. I do not know about dlpar / xen.
> 
> Scenario is like this start a VM with 4GB ram, reduce to 2GB with balloon - 
> vm can be upgraded.
> 
> All we are talking about relates to memory hotplug/unplug /where unplug is  
> close to nonexistant hence the balloons are used./
> 
> All values should be recalculated on memory hotplug too, so you can use the 
> newly available RAM.
> 
> RAM is the most valuable resource of all so i consider using it optimally to 
> be of a great importance.
> 
>> Side-note: That’s not the case for VMware balloon. I actually considered
>> calling adjust_managed_page_count() just to conform with other balloon
>> drivers. But since we use totalram_pages() to communicate to the hypervisor
>> the total-ram, this would create endless (and wrong) feedback loop. I am not
>> claiming it is not possible to VMware balloon driver to call
>> adjust_managed_page_count(), but the chances are that it would create more
>> harm than good.
> 
> Virtio does both - depending on the deflate on OOM option. I suggested 
> already to unify all drivers to inflate the used memory as it seems more 
> logical to me since  no body expects the totalram_pages() to change but the 
> current state is that both ways are accepted and if changed can break 
> existing users.
> See  discussion here 
> https://lore.kernel.org/lkml/20220809095358.2203355-1-alexander.atana...@virtuozzo.com/.

Thanks for the reminder. I wish you can somehow summarize all of that into the
cover-letter and/or the commit messages for these patches.

> 
> 
>> Back to the matter at hand. It seems that you wish that the notifiers would
>> be called following any changes that would be reflected in totalram_pages().
>> So, doesn't it make more sense to call it from adjust_managed_page_count() ?
> 
> It will hurt performance - all drivers work page by page , i.e. they update 
> by +1/-1 and they do so under locks which as you already noted can lead to 
> bad things. The notifier will accumulate the change and let its user know how 
> much changed, so the can decide if they have to recalculate - it even can do 
> so async in order to not disturb the drivers.

So updating the counters by 1 is ok (using atomic operation, which is not
free)? And the reason it is (relatively) cheap is because nobody actually
looks at the value (i.e., nobody actually acts on the value)?

If nobody considers the value, then doesn’t it make sense just to update it
less frequently, and then call the notifiers?

>>> The difference is here:
>>> 
>>> mm/zswap.c: return totalram_pages() * zswap_max_pool_percent / 100 <
>>> mm/zswap.c: return totalram_pages() * zswap_accept_thr_percent / 100
>>> uses percents and you can recalculate easy with
>>> 
>>> +static inline unsigned long totalram_pages_current(void)
>>> +{
>>> +   unsigned long inflated = 0;
>>> +#ifdef CONFIG_MEMORY_BALLOON
>>> +   extern atomic_long_t mem_balloon_inflated_free_kb;
>>> +   inflated = atomic_long_read(_balloon_inflated_free_kb);
>>> +   inflated >>= (PAGE_SHIFT - 10);
>>> +#endif
>>> +   return (unsigned long)atomic_long_read(&_totalram_pages) - inflated;
>>> +}

So w

Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory

2022-10-10 Thread Nadav Amit
On Oct 7, 2022, at 3:58 AM, Alexander Atanasov 
 wrote:

> On 7.10.22 0:07, Nadav Amit wrote:
>>>> 
>>>> I was looking through the series and I did not see actual users of the
>>>> notifier. Usually, it is not great to build an API without users.
>>> 
>>> 
>>> You are right. I hope to get some feedback/interest from potential users 
>>> that i mentioned in the cover letter. I will probably split the notifier
>>> in separate series. To make it usefull it will require more changes.
>>> See bellow more about them.
>> Fair, but this is something that is more suitable for RFC. Otherwise, more
>> likely than not - your patches would go in as is.
> 
> Yes, i will remove the notifier and resend both as RFC. I think that every 
> patch is an RFC and RFC tag is used for more general changes that could 
> affect unexpected areas, change functionality, change design and in general 
> can lead to bigger impact. In the case with this it adds functionality that 
> is missing and it could hardly affect anything else.
> In essence it provides information that you can not get without it.
> But i will take your advice and push everything thru RFC from now on.

Jus keep the version numbers as you had before. That’s fine and better to
prevent confusion.

>>>> [snip]
>>>>> +
>>>>> +static int balloon_notify(unsigned long val)
>>>>> +{
>>>>> + return srcu_notifier_call_chain(_chain, val, NULL);
>>>> Since you know the inflated_kb value here, why not to use it as an argument
>>>> to the callback? I think casting to (void *) and back is best. But you can
>>>> also provide pointer to the value. Doesn’t it sound better than having
>>>> potentially different notifiers reading different values?
>>> 
>>> My current idea is to have a struct with current and previous value,
>>> may be change in percents. The actual value does not matter to anyone
>>> but the size of change does. When a user gets notified it can act upon
>>> the change - if it is small it can ignore it , if it is above some 
>>> threshold it can act - if it makes sense for some receiver  is can 
>>> accumulate changes from several notification. Other option/addition is to 
>>> have si_meminfo_current(..) and totalram_pages_current(..) that return 
>>> values adjusted with the balloon values.
>>> 
>>> Going further - there are few places that calculate something based on 
>>> available memory that do not have sysfs/proc interface for setting limits. 
>>> Most of them work in percents so they can be converted to do calculations 
>>> when they get notification.
>>> 
>>> The one that have interface for configuration but use memory values can be 
>>> handled in two ways - convert to use percents of what is available or 
>>> extend the notifier to notify userspace which in turn to do calculations 
>>> and update configuration.
>> I really need to see code to fully understand what you have in mind.
> 
> Sure - you can check some of the users with git grep totalram_pages - shows 
> self explanatory results of usage like:
> fs/f2fs/node.c:bool f2fs_available_free_memory(struct f2fs_sb_info *sbi, int 
> type) - calculations in percents - one good example
> fs/ceph/super.h:congestion_kb = (16*int_sqrt(totalram_pages())) << 
> (PAGE_SHIFT-10);
> fs/fuse/inode.c:*limit = ((totalram_pages() << PAGE_SHIFT) >> 
> 13) / 392;
> fs/nfs/write.c: nfs_congestion_kb = (16*int_sqrt(totalram_pages())) << 
> (PAGE_SHIFT-10);
> fs/nfsd/nfscache.c: unsigned long low_pages = totalram_pages() - 
> totalhigh_pages()
> mm/oom_kill.c:  oc->totalpages = totalram_pages() + total_swap_pages;
> 
> 
> So all balloon drivers give large amount of RAM on boot , then inflate the 
> balloon. But this places have already been initiallized and they know that 
> the system have given amount of totalram which is not true the moment they 
> start to operate. the result is that too much space gets used and it degrades 
> the userspace performance.
> example - fs/eventpoll.c:static int __init eventpoll_init(void) - 4% of ram 
> for eventpool - when you inflate half of the ram it becomes 8% of the ram - 
> do you really need 8% of your ram to be used for eventpool?
> 
> To solve this you need to register and when notified update - cache size, 
> limits and for whatever is the calculated amount of memory used.

Hmm.. Not sure about all of that. Most balloon drivers are manually managed,
and call adjust_managed_page_count(), and tas a result might want to redo
all the ca

Re: [PATCH v4 2/7] Enable balloon drivers to report inflated memory

2022-10-06 Thread Nadav Amit
On Oct 6, 2022, at 12:34 AM, Alexander Atanasov 
 wrote:

> Hello,
> 
> 
> On 5.10.22 20:25, Nadav Amit wrote:
>> On Oct 5, 2022, at 2:01 AM, Alexander Atanasov 
>>  wrote:
>>> Add counters to be updated by the balloon drivers.
>>> Create balloon notifier to propagate changes.
>> I missed the other patches before (including this one). Sorry, but next
>> time, please cc me.
> 
> You are CCed in the cover letter since the version. I will add CC to you
> in the individual patches if you want so.

Thanks.

Just to clarify - I am not attacking you. It’s more of me making an excuse
for not addressing some issues in earlier versions.

>> I was looking through the series and I did not see actual users of the
>> notifier. Usually, it is not great to build an API without users.
> 
> 
> You are right. I hope to get some feedback/interest from potential users that 
> i mentioned in the cover letter. I will probably split the notifier
> in separate series. To make it usefull it will require more changes.
> See bellow more about them.

Fair, but this is something that is more suitable for RFC. Otherwise, more
likely than not - your patches would go in as is.

>> [snip]
>>> +
>>> +static int balloon_notify(unsigned long val)
>>> +{
>>> +   return srcu_notifier_call_chain(_chain, val, NULL);
>> Since you know the inflated_kb value here, why not to use it as an argument
>> to the callback? I think casting to (void *) and back is best. But you can
>> also provide pointer to the value. Doesn’t it sound better than having
>> potentially different notifiers reading different values?
> 
> My current idea is to have a struct with current and previous value,
> may be change in percents. The actual value does not matter to anyone
> but the size of change does. When a user gets notified it can act upon
> the change - if it is small it can ignore it , if it is above some threshold 
> it can act - if it makes sense for some receiver  is can accumulate changes 
> from several notification. Other option/addition is to have 
> si_meminfo_current(..) and totalram_pages_current(..) that return values 
> adjusted with the balloon values.
> 
> Going further - there are few places that calculate something based on 
> available memory that do not have sysfs/proc interface for setting limits. 
> Most of them work in percents so they can be converted to do calculations 
> when they get notification.
> 
> The one that have interface for configuration but use memory values can be 
> handled in two ways - convert to use percents of what is available or extend 
> the notifier to notify userspace which in turn to do calculations and update 
> configuration.

I really need to see code to fully understand what you have in mind.
Division, as you know, is not something that we really want to do very
frequently.

>> Anyhow, without users (actual notifiers) it’s kind of hard to know how
>> reasonable it all is. For instance, is it balloon_notify() supposed to
>> prevent further balloon inflating/deflating until the notifier completes?
> 
> No, we must avoid that at any cost.
> 
>> Accordingly, are callers to balloon_notify() expected to relinquish locks
>> before calling balloon_notify() to prevent deadlocks and high latency?
> 
> My goal is to avoid any possible impact on performance. Drivers are free to 
> delay notifications if they get in the way. (I see that i need to move the 
> notification after the semaphore in the vmw driver - i missed that - will fix 
> in the next iterration.)
> Deadlocks - depends on the users but a few to none will possibly have to deal 
> with common locks.

I will need to see the next version to give better feedback. One more thing
that comes to mind though is whether saving the balloon size in multiple
places (both mem_balloon_inflated_total_kb and each balloon’s accounting) is
the right way. It does not sounds very clean.

Two other options is to move *all* the accounting to your new
mem_balloon_inflated_total_kb-like interface or expose some per-balloon
function to get the balloon size (indirect-function-call would likely have
some overhead though).

Anyhow, I am not crazy about having the same data replicated. Even from
reading the code stand-of-view it is not intuitive.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 2/7] Enable balloon drivers to report inflated memory

2022-10-05 Thread Nadav Amit
On Oct 5, 2022, at 2:01 AM, Alexander Atanasov 
 wrote:

> Add counters to be updated by the balloon drivers.
> Create balloon notifier to propagate changes.

I missed the other patches before (including this one). Sorry, but next
time, please cc me.

I was looking through the series and I did not see actual users of the
notifier. Usually, it is not great to build an API without users.

[snip] 

> +
> +static int balloon_notify(unsigned long val)
> +{
> + return srcu_notifier_call_chain(_chain, val, NULL);

Since you know the inflated_kb value here, why not to use it as an argument
to the callback? I think casting to (void *) and back is best. But you can
also provide pointer to the value. Doesn’t it sound better than having
potentially different notifiers reading different values?

Anyhow, without users (actual notifiers) it’s kind of hard to know how
reasonable it all is. For instance, is it balloon_notify() supposed to
prevent further balloon inflating/deflating until the notifier completes?
Accordingly, are callers to balloon_notify() expected to relinquish locks
before calling balloon_notify() to prevent deadlocks and high latency?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/3] MAINTAINERS: Change status of some VMware drivers

2022-09-06 Thread Nadav Amit via Virtualization
On Sep 6, 2022, at 10:27 AM, Vishnu Dasa  wrote:

> From: Vishnu Dasa 
> 
> Change the status from 'Maintained' to 'Supported' for VMWARE
> BALLOON DRIVER, VMWARE PVRDMA DRIVER, VMWARE PVSCSI driver,
> VMWARE VMCI DRIVER, VMWARE VMMOUSE SUBDRIVER and VMWARE VMXNET3
> ETHERNET DRIVER.
> 
> This needs to be done to conform to the guidelines in [1].
> Maintainers for these drivers are VMware employees.
> 
> [1] https://docs.kernel.org/process/maintainers.html
> 
> Signed-off-by: Vishnu Dasa 
> ---
> MAINTAINERS | 12 ++--
> 1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index b75eb23a099b..5a634b5d6f6c 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -21812,7 +21812,7 @@ VMWARE BALLOON DRIVER
> M:Nadav Amit 
> R:VMware PV-Drivers Reviewers 
> L:linux-ker...@vger.kernel.org
> -S:   Maintained
> +S:   Supported
> F:drivers/misc/vmw_balloon.c

Acked-by: Nadav Amit 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 1/4] Make place for common balloon code

2022-08-22 Thread Nadav Amit via Virtualization
On Aug 22, 2022, at 4:37 AM, Alexander Atanasov 
 wrote:

> mm/balloon_compaction.c -> mm/balloon.c
> File already contains code that is common along balloon
> drivers so rename it to reflect its contents.
> 
> include/linux/balloon_compaction.h -> include/linux/balloon.h
> Remove it from files which do not actually use it.
> Drop externs from function delcarations.
> 
> Signed-off-by: Alexander Atanasov 

Makes so much sense.

Acked-by: Nadav Amit 


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Potential deadlock in Linux VMCI driver

2022-08-16 Thread Nadav Amit via Virtualization
On Aug 16, 2022, at 11:23 AM, Nadav Amit  wrote:

> During my development, I enabled some Linux kernel checkers, specifically
> the “sleep in atomic” checker.
> 
> I ran into unrelated issue that appears to be a result of commit
> 463713eb6164b6 ("VMCI: dma dg: add support for DMA datagrams receive”).
> IIUC, vmci_read_data() calls wait_event(), which is not allowed while IRQs
> are disabled, which they are during IRQ handling.

Just minor correction of myself: IRQ are not disabled, preemption is
disabled.

> 
> [   22.653012] Preemption disabled at:
> [   22.653017] __do_softirq (kernel/softirq.c:504 kernel/softirq.c:548) 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Potential deadlock in Linux VMCI driver

2022-08-16 Thread Nadav Amit via Virtualization
During my development, I enabled some Linux kernel checkers, specifically
the “sleep in atomic” checker.

I ran into unrelated issue that appears to be a result of commit
463713eb6164b6 ("VMCI: dma dg: add support for DMA datagrams receive”).
IIUC, vmci_read_data() calls wait_event(), which is not allowed while IRQs
are disabled, which they are during IRQ handling.

I think "CONFIG_DEBUG_ATOMIC_SLEEP=y" is the one that triggers the warning
below, which indicates a deadlock is possible.

The splat below (after decoding) was experienced on Linux 5.19. Let me know
if you need me to open a bug in bugzilla or whether this issue is already
known.


[   22.629691] BUG: sleeping function called from invalid context at 
drivers/misc/vmw_vmci/vmci_guest.c:145
[   22.633894] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 775, 
name: cloud-init
[   22.638232] preempt_count: 100, expected: 0
[   22.641887] RCU nest depth: 0, expected: 0
[   22.645461] 1 lock held by cloud-init/775:
[   22.649013] #0: 88810e057200 (>i_mutex_dir_key#6){}-{3:3}, at: 
iterate_dir (fs/readdir.c:46) 
[   22.653012] Preemption disabled at:
[   22.653017] __do_softirq (kernel/softirq.c:504 kernel/softirq.c:548) 
[   22.660264] CPU: 3 PID: 775 Comm: cloud-init Not tainted 5.19.0+ #3
[   22.664004] Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference 
Platform, BIOS VMW201.00V.20253199.B64.2208081742 08/08/2022
[   22.671600] Call Trace:
[   22.675165]  
[   22.678681] dump_stack_lvl (lib/dump_stack.c:107 (discriminator 4)) 
[   22.682303] dump_stack (lib/dump_stack.c:114) 
[   22.685883] __might_resched.cold (kernel/sched/core.c:9822) 
[   22.689500] __might_sleep (kernel/sched/core.c:9751 (discriminator 14)) 
[   22.692961] vmci_read_data (./include/linux/kernel.h:110 
drivers/misc/vmw_vmci/vmci_guest.c:145) vmw_vmci
[   22.696461] ? vmci_interrupt_bm (drivers/misc/vmw_vmci/vmci_guest.c:121) 
vmw_vmci
[   22.699920] ? __this_cpu_preempt_check (lib/smp_processor_id.c:67) 
[   22.703305] ? wake_up_var (./include/linux/list.h:292 
./include/linux/wait.h:129 kernel/sched/wait_bit.c:125 
kernel/sched/wait_bit.c:193) 
[   22.706526] ? cpuusage_read (kernel/sched/wait_bit.c:192) 
[   22.709682] ? mark_held_locks (kernel/locking/lockdep.c:4234) 
[   22.712779] vmci_dispatch_dgs (drivers/misc/vmw_vmci/vmci_guest.c:332) 
vmw_vmci
[   22.715923] tasklet_action_common.constprop.0 (kernel/softirq.c:799) 
[   22.719008] ? vmci_read_data (drivers/misc/vmw_vmci/vmci_guest.c:308) 
vmw_vmci
[   22.722018] tasklet_action (kernel/softirq.c:819) 
[   22.724865] __do_softirq (kernel/softirq.c:571) 
[   22.727650] __irq_exit_rcu (kernel/softirq.c:445 kernel/softirq.c:650) 
[   22.730348] irq_exit_rcu (kernel/softirq.c:664) 
[   22.732947] common_interrupt (arch/x86/kernel/irq.c:240 (discriminator 14)) 
[   22.735513]  
[   22.737879]  
[   22.740141] asm_common_interrupt (./arch/x86/include/asm/idtentry.h:640) 
[   22.742498] RIP: 0010:stack_trace_consume_entry (kernel/stacktrace.c:83) 
[ 22.744891] Code: be 80 01 00 00 48 c7 c7 40 82 cd 82 48 89 e5 e8 7d 38 53 00 
5d c3 cc cc cc cc cc cc cc cc cc cc cc 0f 1f 44 00 00 55 48 89 e5 <41> 55 49 89 
f5 41 54 53 48 89 fb 48 83 c7 10 e8 23 e0 36 00 48 8d
All code

   0:   be 80 01 00 00  mov$0x180,%esi
   5:   48 c7 c7 40 82 cd 82mov$0x82cd8240,%rdi
   c:   48 89 e5mov%rsp,%rbp
   f:   e8 7d 38 53 00  call   0x533891
  14:   5d  pop%rbp
  15:   c3  ret
  16:   cc  int3   
  17:   cc  int3   
  18:   cc  int3   
  19:   cc  int3   
  1a:   cc  int3   
  1b:   cc  int3   
  1c:   cc  int3   
  1d:   cc  int3   
  1e:   cc  int3   
  1f:   cc  int3   
  20:   cc  int3   
  21:   0f 1f 44 00 00  nopl   0x0(%rax,%rax,1)
  26:   55  push   %rbp
  27:   48 89 e5mov%rsp,%rbp
  2a:*  41 55   push   %r13 <-- trapping instruction
  2c:   49 89 f5mov%rsi,%r13
  2f:   41 54   push   %r12
  31:   53  push   %rbx
  32:   48 89 fbmov%rdi,%rbx
  35:   48 83 c7 10 add$0x10,%rdi
  39:   e8 23 e0 36 00  call   0x36e061
  3e:   48  rex.W
  3f:   8d  .byte 0x8d

Code starting with the faulting instruction
===
   0:   41 55   push   %r13
   2:   49 89 f5mov%rsi,%r13
   5:   41 54   push   %r12
   7:   53  push   %rbx
   8:   48 89 fbmov%rdi,%rbx
   b:   48 83 c7 10 add$0x10,%rdi
   f:   e8 23 e0 36 00  call   0x36e037
  14:   48 

Re: [PATCH v1 2/2] Drivers: virtio: balloon: Report inflated memory

2022-08-15 Thread Nadav Amit via Virtualization
On Aug 15, 2022, at 5:52 AM, Alexander Atanasov 
 wrote:

> ⚠ External Email
> 
> Hi,
> 
> On 9.08.22 20:44, Nadav Amit wrote:
>> On Aug 9, 2022, at 2:53 AM, Alexander Atanasov 
>>  wrote:
>> 
>>> Update the value in page_alloc on balloon fill/leak.
>> 
>> Some general comments if this patch goes forward.
>> 
>> Please cc pv-driv...@vmware.com in the future.
> 
> Ok.
> 
>>> Cc: David Hildenbrand 
>>> Cc: Wei Liu 
>>> Cc: Nadav Amit 
>>> 
>>> Signed-off-by: Alexander Atanasov 
>>> ---
>>> drivers/virtio/virtio_balloon.c | 13 +
>>> 1 file changed, 13 insertions(+)
>>> 
>>> Firts user, other balloons i will do if it is accepted to avoid too much 
>>> emails.
>>> 
>>> 
>>> diff --git a/drivers/virtio/virtio_balloon.c 
>>> b/drivers/virtio/virtio_balloon.c
>>> index b9737da6c4dd..e2693ffbd48b 100644
>>> --- a/drivers/virtio/virtio_balloon.c
>>> +++ b/drivers/virtio/virtio_balloon.c
>>> @@ -208,6 +208,16 @@ static void set_page_pfns(struct virtio_balloon *vb,
>>>   page_to_balloon_pfn(page) + i);
>>> }
>>> 
>>> +static void update_meminfo(struct virtio_balloon *vb)
>> 
>> Putting aside the less-than-optimal function name, I would like to ask that
> 
> Right, i will think of a better one.
> 
>> any new generic balloon logic would go into balloon_compaction.[hc] as much
> 
> If it is going to be a place for generic logic may be it should be
> renamed to balloon_common.[ch] ?
> 
>> as possible. I made the effort to reuse this infrastructure (which is now
>> used by both VMware and virtio), and would prefer to share as much code as
>> possible.
>> 
>> For instance, I would appreciate if the update upon inflate would go into
>> balloon_page_list_enqueue() and balloon_page_enqueue(). VMware's 2MB pages
>> logic is not shared, so it would require a change that is specific for
>> VMware code.
> 
> I looked at the code and i do not see how i can reuse it since
> BALLOON_COMPACTION can be disabled and as you say even for VMWare it
> would require updates on other places. Looks like if i do so i would
> have to handle update from each driver for both cases. I think it is
> better to clearly mark the spots when drivers do their internal
> recalculations and report to the core. I see only VMWare is using
> balloon_page_list_enqueue , virtio balloon is using only migration and
> i don't see how to hook it there - i haven't checked the rest of the
> balloons but i guess it would be similiar . I agree it is a good to have
> a common place for such logic but it might be better of for a separate
> work in the future.

Fine. I would live with whatever you do and if needed change it later.

Thanks for considering.

Regards,
Nadav
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v1 2/2] Drivers: virtio: balloon: Report inflated memory

2022-08-09 Thread Nadav Amit via Virtualization


On Aug 9, 2022, at 2:53 AM, Alexander Atanasov 
 wrote:

> Update the value in page_alloc on balloon fill/leak.

Some general comments if this patch goes forward.

Please cc pv-driv...@vmware.com in the future.

> 
> Cc: David Hildenbrand 
> Cc: Wei Liu 
> Cc: Nadav Amit 
> 
> Signed-off-by: Alexander Atanasov 
> ---
> drivers/virtio/virtio_balloon.c | 13 +
> 1 file changed, 13 insertions(+)
> 
> Firts user, other balloons i will do if it is accepted to avoid too much 
> emails.
> 
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b9737da6c4dd..e2693ffbd48b 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -208,6 +208,16 @@ static void set_page_pfns(struct virtio_balloon *vb,
> page_to_balloon_pfn(page) + i);
> }
> 
> +static void update_meminfo(struct virtio_balloon *vb)

Putting aside the less-than-optimal function name, I would like to ask that
any new generic balloon logic would go into balloon_compaction.[hc] as much
as possible. I made the effort to reuse this infrastructure (which is now
used by both VMware and virtio), and would prefer to share as much code as
possible.

For instance, I would appreciate if the update upon inflate would go into
balloon_page_list_enqueue() and balloon_page_enqueue(). VMware's 2MB pages
logic is not shared, so it would require a change that is specific for
VMware code.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [Pv-drivers] [PATCH 29/36] cpuidle, xenpv: Make more PARAVIRT_XXL noinstr clean

2022-06-13 Thread Nadav Amit via Virtualization
On Jun 13, 2022, at 11:48 AM, Srivatsa S. Bhat  wrote:

> ⚠ External Email
> 
> On 6/8/22 4:27 PM, Peter Zijlstra wrote:
>> vmlinux.o: warning: objtool: acpi_idle_enter_s2idle+0xde: call to wbinvd() 
>> leaves .noinstr.text section
>> vmlinux.o: warning: objtool: default_idle+0x4: call to arch_safe_halt() 
>> leaves .noinstr.text section
>> vmlinux.o: warning: objtool: xen_safe_halt+0xa: call to 
>> HYPERVISOR_sched_op.constprop.0() leaves .noinstr.text section
>> 
>> Signed-off-by: Peter Zijlstra (Intel) 
> 
> Reviewed-by: Srivatsa S. Bhat (VMware) 
> 
>> 
>> -static inline void wbinvd(void)
>> +extern noinstr void pv_native_wbinvd(void);
>> +
>> +static __always_inline void wbinvd(void)
>> {
>>  PVOP_ALT_VCALL0(cpu.wbinvd, "wbinvd", ALT_NOT(X86_FEATURE_XENPV));
>> }

I guess it is yet another instance of wrong accounting of GCC for
the assembly blocks’ weight. I guess it is not a solution for older
GCCs, but presumably PVOP_ALT_CALL() and friends should have
used asm_inline or some new “asm_volatile_inline” variant.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/2] MAINTAINERS: Mark VMware mailing list entries as private

2021-11-10 Thread Nadav Amit


> On Nov 10, 2021, at 9:20 AM, Srivatsa S. Bhat  wrote:
> 
> On Tue, Nov 09, 2021 at 01:57:31PM -0800, Joe Perches wrote:
>> On Tue, 2021-11-09 at 00:58 +0000, Nadav Amit wrote:
>>>> On Nov 8, 2021, at 4:37 PM, Joe Perches  wrote:
>>>> On Mon, 2021-11-08 at 16:22 -0800, Srivatsa S. Bhat wrote:
>>>> 
>>>> So it's an exploder not an actual maintainer and it likely isn't
>>>> publically archived with any normal list mechanism.
>>>> 
>>>> So IMO "private" isn't appropriate.  Neither is "L:"
>>>> Perhaps just mark it as what it is as an "exploder".
>>>> 
>>>> Or maybe these blocks should be similar to:
>>>> 
>>>> M: Name of Lead Developer 
>>>> M: VMware  maintainers -maintain...@vmlinux.com>
>> 
>> Maybe adding entries like
>> 
>> M:   Named maintainer 
>> R:   VMware  reviewers -maintain...@vmware.com>
>> 
>> would be best/simplest.
>> 
> 
> Sure, that sounds good to me. I also considered adding "(email alias)"
> like Juergen suggested, but I think the R: entry is clear enough.
> Please find the updated patch below.
> 
> ---
> 
> From f66faa238facf504cfc66325912ce7af8cbf79ec Mon Sep 17 00:00:00 2001
> From: "Srivatsa S. Bhat (VMware)" 
> Date: Mon, 8 Nov 2021 11:46:57 -0800
> Subject: [PATCH v2 2/2] MAINTAINERS: Mark VMware mailing list entries as email
> aliases
> 
> VMware mailing lists in the MAINTAINERS file are private lists meant
> for VMware-internal review/notification for patches to the respective
> subsystems. Anyone can post to these addresses, but there is no public
> read access like open mailing lists, which makes them more like email
> aliases instead (to reach out to reviewers).
> 
> So update all the VMware mailing list references in the MAINTAINERS
> file to mark them as such, using "R: email-al...@vmware.com".
> 
> Signed-off-by: Srivatsa S. Bhat (VMware) 
> Cc: Zack Rusin 
> Cc: Nadav Amit 
> Cc: Vivek Thampi 
> Cc: Vishal Bhakta 
> Cc: Ronak Doshi 
> Cc: pv-driv...@vmware.com
> Cc: linux-graphics-maintai...@vmware.com
> Cc: dri-de...@lists.freedesktop.org
> Cc: linux-r...@vger.kernel.org
> Cc: linux-s...@vger.kernel.org
> Cc: net...@vger.kernel.org
> Cc: linux-in...@vger.kernel.org
> ---
> MAINTAINERS | 22 +++---
> 1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 118cf8170d02..4372d79027e9 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6134,8 +6134,8 @@ T:  git git://anongit.freedesktop.org/drm/drm-misc
> F:drivers/gpu/drm/vboxvideo/
> 
> DRM DRIVER FOR VMWARE VIRTUAL GPU
> -M:   "VMware Graphics" 
> M:Zack Rusin 
> +R:   VMware Graphics Reviewers 
> L:dri-de...@lists.freedesktop.org
> S:Supported
> T:git git://anongit.freedesktop.org/drm/drm-misc
> @@ -14189,7 +14189,7 @@ F:include/uapi/linux/ppdev.h
> PARAVIRT_OPS INTERFACE
> M:Juergen Gross 
> M:Srivatsa S. Bhat (VMware) 
> -L:   pv-driv...@vmware.com (private)
> +R:   VMware PV-Drivers Reviewers 

This patch that you just sent seems to go on top of the previous patches
(as it removes "L: pv-driv...@vmware.com (private)”).

Since the patches were still not merged, I would presume you should squash
the old 2/2 with this new patch and send v3 of these patches.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/2] MAINTAINERS: Mark VMware mailing list entries as private

2021-11-08 Thread Nadav Amit



> On Nov 8, 2021, at 4:37 PM, Joe Perches  wrote:
> 
> On Mon, 2021-11-08 at 16:22 -0800, Srivatsa S. Bhat wrote:
> 
> So it's an exploder not an actual maintainer and it likely isn't
> publically archived with any normal list mechanism.
> 
> So IMO "private" isn't appropriate.  Neither is "L:"
> Perhaps just mark it as what it is as an "exploder".
> 
> Or maybe these blocks should be similar to:
> 
> M:Name of Lead Developer 
> M:VMware  maintainers -maintain...@vmlinux.com>
> 
> Maybe something like a comment mechanism should be added to the
> MAINTAINERS file.
> 
> Maybe #
> 
> so this entry could be something like:
> 
> M:VMware  maintainers -maintain...@vmlinux.com> # 
> VMware's ever changing internal maintainers list

Admittedly, I do not care much about how it turns to be.

But if it is modified, it should be very clear who the maintainer
is, and not to entangle the mailing list and the maintainer.

I am personally not subscribed to the internal pv-drivers mailing
list, which is not just for memory ballooning, and is also listed
as a maintainer for vmmouse, pvscsi, vmxnet3 and others.

As I am the only maintainer of VMware balloon, if someone is
mistaken and sends an email only to the mailing list and not me,
he might be disappointed.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] MAINTAINERS: Mark VMware mailing list entries as private

2021-11-08 Thread Nadav Amit


> On Nov 8, 2021, at 12:30 PM, Srivatsa S. Bhat  wrote:
> 
> From: Srivatsa S. Bhat (VMware) 
> 
> VMware mailing lists in the MAINTAINERS file are private lists meant
> for VMware-internal review/notification for patches to the respective
> subsystems. So, in an earlier discussion [1][2], it was recommended to
> mark them as such. Update all the remaining VMware mailing list
> references to use that format -- "L: list@address (private)”.

Acked-by: Nadav Amit 



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [kvm-unit-tests PATCH 0/6] Initial x86_64 UEFI support

2021-08-18 Thread Nadav Amit


> On Aug 18, 2021, at 6:32 PM, Marc Orr  wrote:
> 
> This sounds great to us. We will also experiment with combining the
> two patchsets and report back when we have some experience with this.
> Though, please do also report back if you have an update on this
> before we do.

Just wondering (aka, my standard question): does it work on
bare-metal (putting aside tests that rely on test-devices)?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v6 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2021-02-20 Thread Nadav Amit
From: Nadav Amit 

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Reviewed-by: Michael Kelley  # Hyper-v parts
Reviewed-by: Juergen Gross  # Xen and paravirt parts
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 

---
v5->v6:
* Use on_each_cpu_mask() instead of on_each_cpu_cond_mask() [PeterZ]
* Use cond_cpumask when needed instead of cpumask
* Rename remaining instance of native_flush_tlb_others()
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  4 +--
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 46 +--
 arch/x86/xen/mmu_pv.c | 11 +++
 include/trace/events/xen.h|  2 +-
 10 files changed, 57 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 2c87350c1fb0..681dba8de4f2 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -52,8 +52,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -61,7 +61,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -164,7 +164,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -239,6 +239,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 4abf110e2243..45b55e3e0630 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -50,7 +50,7 @@ static inline void slow_down_io(void)
 void native_flush_tlb_local(void);
 void native_flush_tlb_global(void);
 void native_flush_tlb_one_user(unsigned long addr);
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
 static inline void __flush_tlb_local(void)
@@ -68,10 +68,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void __flush_tlb_others(const struct cpumask *cpumask,
+static inline void __flush_tlb_multi(const struct cpumask *cpumask,
  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/p

[PATCH v6 0/9] x86/tlb: Concurrent TLB flushes

2021-02-20 Thread Nadav Amit
From: Nadav Amit 

The series improves TLB shootdown by flushing the local TLB concurrently
with remote TLBs, overlapping the IPI delivery time with the local
flush. Performance numbers can be found in the previous version [1].

v5 was rebased on 5.11 (long time after v4), and had some bugs and
embarrassing build errors. Peter Zijlstra and Christoph Hellwig had some
comments as well. These issues were addressed (excluding one 82-chars
line that I left). Based on their feedback, an additional patch was also
added to reuse on_each_cpu_cond_mask() code and avoid unnecessary calls
by inlining.

KernelCI showed RCU stalls on arm64, which I could not figure out from
the kernel splat. If this issue persists, I would appreciate it someone
can assist in debugging or at least provide the output when running the
kernel with CONFIG_CSD_LOCK_WAIT_DEBUG=Y.

[1] https://lore.kernel.org/lkml/20190823224153.15223-1-na...@vmware.com/

v5 -> v6:
* Address build warnings due to rebase mistakes
* Reuse code from on_each_cpu_cond_mask() and inline [PeterZ]
* Fix some style issues [Hellwig]

v4 -> v5:
* Rebase on 5.11
* Move concurrent smp logic to smp_call_function_many_cond() 
* Remove SGI-UV patch which is not needed anymore

v3 -> v4:
* Merge flush_tlb_func_local and flush_tlb_func_remote() [Peter]
* Prevent preemption on_each_cpu(). It is not needed, but it prevents
  concerns. [Peter/tglx]
* Adding acked-, review-by tags

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many_cond()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove unnecessary uses of the inline keyword
  smp: inline on_each_cpu_cond() and on_each_cpu()

 arch/x86/hyperv/mmu.c |  10 +-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  48 ---
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/alternative.c |   2 +-
 arch/x86/kernel/kvm.c |  11 +-
 arch/x86/kernel/paravirt.c|   2 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 176 +--
 arch/x86/xen/mmu_pv.c |  11 +-
 include/linux/cpumask.h   |   6 +-
 include/linux/smp.h   |  50 +--
 include/trace/events/xen.h|   2 +-
 kernel/smp.c  | 196 +++---
 15 files changed, 278 insertions(+), 250 deletions(-)

-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 4/8] x86/mm/tlb: Flush remote and local TLBs concurrently

2021-02-16 Thread Nadav Amit
> On Feb 16, 2021, at 4:10 AM, Peter Zijlstra  wrote:
> 
> On Tue, Feb 09, 2021 at 02:16:49PM -0800, Nadav Amit wrote:
>> @@ -816,8 +821,8 @@ STATIC_NOPV void native_flush_tlb_others(const struct 
>> cpumask *cpumask,
>>   * doing a speculative memory access.
>>   */
>>  if (info->freed_tables) {
>> -smp_call_function_many(cpumask, flush_tlb_func,
>> -   (void *)info, 1);
>> +on_each_cpu_cond_mask(NULL, flush_tlb_func, (void *)info, true,
>> +  cpumask);
>>  } else {
>>  /*
>>   * Although we could have used on_each_cpu_cond_mask(),
>> @@ -844,14 +849,15 @@ STATIC_NOPV void native_flush_tlb_others(const struct 
>> cpumask *cpumask,
>>  if (tlb_is_not_lazy(cpu))
>>  __cpumask_set_cpu(cpu, cond_cpumask);
>>  }
>> -smp_call_function_many(cond_cpumask, flush_tlb_func, (void 
>> *)info, 1);
>> +on_each_cpu_cond_mask(NULL, flush_tlb_func, (void *)info, true,
>> +  cpumask);
>>  }
>> }
> 
> Surely on_each_cpu_mask() is more appropriate? There the compiler can do
> the NULL propagation because it's on the same TU.
> 
> --- a/arch/x86/mm/tlb.c
> +++ b/arch/x86/mm/tlb.c
> @@ -821,8 +821,7 @@ STATIC_NOPV void native_flush_tlb_multi(
>* doing a speculative memory access.
>*/
>   if (info->freed_tables) {
> - on_each_cpu_cond_mask(NULL, flush_tlb_func, (void *)info, true,
> -   cpumask);
> + on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
>   } else {
>   /*
>* Although we could have used on_each_cpu_cond_mask(),
> @@ -849,8 +848,7 @@ STATIC_NOPV void native_flush_tlb_multi(
>   if (tlb_is_not_lazy(cpu))
>   __cpumask_set_cpu(cpu, cond_cpumask);
>   }
> - on_each_cpu_cond_mask(NULL, flush_tlb_func, (void *)info, true,
> -   cpumask);
> + on_each_cpu_mask(cpumask, flush_tlb_func, (void *)info, true);
>   }
> }

Indeed, and there is actually an additional bug - I used cpumask in the
second on_each_cpu_cond_mask() instead of cond_cpumask.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v5 4/8] x86/mm/tlb: Flush remote and local TLBs concurrently

2021-02-09 Thread Nadav Amit
From: Nadav Amit 

To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Reviewed-by: Michael Kelley  # Hyper-v parts
Reviewed-by: Juergen Gross  # Xen and paravirt parts
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  4 +--
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 --
 arch/x86/mm/tlb.c | 49 +--
 arch/x86/xen/mmu_pv.c | 11 +++---
 include/trace/events/xen.h|  2 +-
 9 files changed, 58 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index 2c87350c1fb0..681dba8de4f2 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -52,8 +52,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -61,7 +61,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -164,7 +164,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -239,6 +239,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index f8dce11d2bc1..515e49204c8b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -50,7 +50,7 @@ static inline void slow_down_io(void)
 void native_flush_tlb_local(void);
 void native_flush_tlb_global(void);
 void native_flush_tlb_one_user(unsigned long addr);
-void native_flush_tlb_others(const struct cpumask *cpumask,
+void native_flush_tlb_multi(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
 static inline void __flush_tlb_local(void)
@@ -68,10 +68,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void __flush_tlb_others(const struct cpumask *cpumask,
+static inline void __flush_tlb_multi(const struct cpumask *cpumask,
  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index b6b02b7c19cc..541fe7193526 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -201,8 +201,8 @@ struct pv_mmu_ops {
voi

[PATCH v5 0/8] x86/tlb: Concurrent TLB flushes

2021-02-09 Thread Nadav Amit
From: Nadav Amit 

This is a respin of a rebased version of an old series, which I did not
follow, as I was preoccupied with personal issues (sorry).

The series improve TLB shootdown by flushing the local TLB concurrently
with remote TLBs, overlapping the IPI delivery time with the local
flush. Performance numbers can be found in the previous version [1].

The patches are essentially the same, but rebasing on the last version
required some changes. I left the reviewed-by tags - if anyone considers
it inappropriate, please let me know (and you have my apology).

[1] https://lore.kernel.org/lkml/20190823224153.15223-1-na...@vmware.com/

v4 -> v5:
* Rebase on 5.11
* Move concurrent smp logic to smp_call_function_many_cond() 
* Remove SGI-UV patch which is not needed anymore

v3 -> v4:
* Merge flush_tlb_func_local and flush_tlb_func_remote() [Peter]
* Prevent preemption on_each_cpu(). It is not needed, but it prevents
  concerns. [Peter/tglx]
* Adding acked-, review-by tags

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (8):
  smp: Run functions concurrently in smp_call_function_many_cond()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c |  10 +-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  48 +++
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/alternative.c |   2 +-
 arch/x86/kernel/kvm.c |  11 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 177 +++---
 arch/x86/xen/mmu_pv.c |  11 +-
 include/linux/cpumask.h   |   6 +-
 include/trace/events/xen.h|   2 +-
 kernel/smp.c  | 148 +++--
 13 files changed, 242 insertions(+), 187 deletions(-)

-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue

2020-03-31 Thread Nadav Amit via Virtualization
> On Mar 31, 2020, at 7:09 AM, David Hildenbrand  wrote:
> 
> On 31.03.20 16:07, Michael S. Tsirkin wrote:
>> On Tue, Mar 31, 2020 at 04:03:18PM +0200, David Hildenbrand wrote:
>>> On 31.03.20 15:37, Michael S. Tsirkin wrote:
 On Tue, Mar 31, 2020 at 03:32:05PM +0200, David Hildenbrand wrote:
> On 31.03.20 15:24, Michael S. Tsirkin wrote:
>> On Tue, Mar 31, 2020 at 12:35:24PM +0200, David Hildenbrand wrote:
>>> On 26.03.20 10:49, Michael S. Tsirkin wrote:
 On Thu, Mar 26, 2020 at 08:54:04AM +0100, David Hildenbrand wrote:
>> Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin :
>> 
>> On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote:
 On 12.03.20 09:47, Michael S. Tsirkin wrote:
 On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote:
> 2. You are essentially stealing THPs in the guest. So the fastest
> mapping (THP in guest and host) is gone. The guest won't be able 
> to make
> use of THP where it previously was able to. I can imagine this 
> implies a
> performance degradation for some workloads. This needs a proper
> performance evaluation.
 
 I think the problem is more with the alloc_pages API.
 That gives you exactly the given order, and if there's
 a larger chunk available, it will split it up.
 
 But for balloon - I suspect lots of other users,
 we do not want to stress the system but if a large
 chunk is available anyway, then we could handle
 that more optimally by getting it all in one go.
 
 
 So if we want to address this, IMHO this calls for a new API.
 Along the lines of
 
   struct page *alloc_page_range(gfp_t gfp, unsigned int min_order,
   unsigned int max_order, unsigned int *order)
 
 the idea would then be to return at a number of pages in the given
 range.
 
 What do you think? Want to try implementing that?
>>> 
>>> You can just start with the highest order and decrement the order 
>>> until
>>> your allocation succeeds using alloc_pages(), which would be enough 
>>> for
>>> a first version. At least I don't see the immediate need for a new
>>> kernel API.
>> 
>> OK I remember now.  The problem is with reclaim. Unless reclaim is
>> completely disabled, any of these calls can sleep. After it wakes up,
>> we would like to get the larger order that has become available
>> meanwhile.
> 
> Yes, but that‘s a pure optimization IMHO.
> So I think we should do a trivial implementation first and then see 
> what we gain from a new allocator API. Then we might also be able to 
> justify it using real numbers.
 
 Well how do you propose implement the necessary semantics?
 I think we are both agreed that alloc_page_range is more or
 less what's necessary anyway - so how would you approximate it
 on top of existing APIs?
>>> diff --git a/include/linux/balloon_compaction.h 
>>> b/include/linux/balloon_compaction.h
 
 .
 
 
>>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>>> index 26de020aae7b..067810b32813 100644
>>> --- a/mm/balloon_compaction.c
>>> +++ b/mm/balloon_compaction.c
>>> @@ -112,23 +112,35 @@ size_t balloon_page_list_dequeue(struct 
>>> balloon_dev_info *b_dev_info,
>>> EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>>> 
>>> /*
>>> - * balloon_page_alloc - allocates a new page for insertion into the 
>>> balloon
>>> - * page list.
>>> + * balloon_pages_alloc - allocates a new page (of at most the given 
>>> order)
>>> + *  for insertion into the balloon page list.
>>>  *
>>>  * Driver must call this function to properly allocate a new balloon 
>>> page.
>>>  * Driver must call balloon_page_enqueue before definitively removing 
>>> the page
>>>  * from the guest system.
>>>  *
>>> + * Will fall back to smaller orders if allocation fails. The order of 
>>> the
>>> + * allocated page is stored in page->private.
>>> + *
>>>  * Return: struct page for the allocated page or NULL on allocation 
>>> failure.
>>>  */
>>> -struct page *balloon_page_alloc(void)
>>> +struct page *balloon_pages_alloc(int order)
>>> {
>>> -   struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>>> -  __GFP_NOMEMALLOC | __GFP_NORETRY 
>>> |
>>> -  __GFP_NOWARN);
>>> -   return page;

Re: [RFC for Linux] virtio_balloon: Add VIRTIO_BALLOON_F_THP_ORDER to handle THP spilt issue

2020-03-31 Thread Nadav Amit via Virtualization
> On Mar 31, 2020, at 6:32 AM, David Hildenbrand  wrote:
> 
> On 31.03.20 15:24, Michael S. Tsirkin wrote:
>> On Tue, Mar 31, 2020 at 12:35:24PM +0200, David Hildenbrand wrote:
>>> On 26.03.20 10:49, Michael S. Tsirkin wrote:
 On Thu, Mar 26, 2020 at 08:54:04AM +0100, David Hildenbrand wrote:
>> Am 26.03.2020 um 08:21 schrieb Michael S. Tsirkin :
>> 
>> On Thu, Mar 12, 2020 at 09:51:25AM +0100, David Hildenbrand wrote:
 On 12.03.20 09:47, Michael S. Tsirkin wrote:
 On Thu, Mar 12, 2020 at 09:37:32AM +0100, David Hildenbrand wrote:
> 2. You are essentially stealing THPs in the guest. So the fastest
> mapping (THP in guest and host) is gone. The guest won't be able to 
> make
> use of THP where it previously was able to. I can imagine this 
> implies a
> performance degradation for some workloads. This needs a proper
> performance evaluation.
 
 I think the problem is more with the alloc_pages API.
 That gives you exactly the given order, and if there's
 a larger chunk available, it will split it up.
 
 But for balloon - I suspect lots of other users,
 we do not want to stress the system but if a large
 chunk is available anyway, then we could handle
 that more optimally by getting it all in one go.
 
 
 So if we want to address this, IMHO this calls for a new API.
 Along the lines of
 
   struct page *alloc_page_range(gfp_t gfp, unsigned int min_order,
   unsigned int max_order, unsigned int *order)
 
 the idea would then be to return at a number of pages in the given
 range.
 
 What do you think? Want to try implementing that?
>>> 
>>> You can just start with the highest order and decrement the order until
>>> your allocation succeeds using alloc_pages(), which would be enough for
>>> a first version. At least I don't see the immediate need for a new
>>> kernel API.
>> 
>> OK I remember now.  The problem is with reclaim. Unless reclaim is
>> completely disabled, any of these calls can sleep. After it wakes up,
>> we would like to get the larger order that has become available
>> meanwhile.
> 
> Yes, but that‘s a pure optimization IMHO.
> So I think we should do a trivial implementation first and then see what 
> we gain from a new allocator API. Then we might also be able to justify 
> it using real numbers.
 
 Well how do you propose implement the necessary semantics?
 I think we are both agreed that alloc_page_range is more or
 less what's necessary anyway - so how would you approximate it
 on top of existing APIs?
>>> 
>>> Looking at drivers/misc/vmw_balloon.c:vmballoon_inflate(), it first
>>> tries to allocate huge pages using
>>> 
>>> alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN| __GFP_NOMEMALLOC, 
>>>VMW_BALLOON_2M_ORDER)
>>> 
>>> And then falls back to 4k allocations (balloon_page_alloc()) in case
>>> allocation fails.
>>> 
>>> I'm roughly thinking of something like the following, but with an
>>> optimized reporting interface/bigger pfn array so we can report >
>>> 1MB at a time. Also, it might make sense to remember the order that
>>> succeeded across some fill_balloon() calls.
>>> 
>>> Don't even expect it to compile ...
>>> 
>>> 
>>> 
 From 4305f989672ccca4be9293e6d4167e929f3e299b Mon Sep 17 00:00:00 2001
>>> From: David Hildenbrand 
>>> Date: Tue, 31 Mar 2020 12:28:07 +0200
>>> Subject: [PATCH RFC] tmp
>>> 
>>> Signed-off-by: David Hildenbrand 
>>> ---
>>> drivers/virtio/virtio_balloon.c| 38 ++
>>> include/linux/balloon_compaction.h |  7 -
>>> mm/balloon_compaction.c| 43 +++---
>>> 3 files changed, 67 insertions(+), 21 deletions(-)
>>> 
>>> diff --git a/drivers/virtio/virtio_balloon.c 
>>> b/drivers/virtio/virtio_balloon.c
>>> index 8511d258dbb4..0660b1b988f0 100644
>>> --- a/drivers/virtio/virtio_balloon.c
>>> +++ b/drivers/virtio/virtio_balloon.c
>>> @@ -187,7 +187,7 @@ int virtballoon_free_page_report(struct 
>>> page_reporting_dev_info *pr_dev_info,
>>> }
>>> 
>>> static void set_page_pfns(struct virtio_balloon *vb,
>>> - __virtio32 pfns[], struct page *page)
>>> + __virtio32 pfns[], struct page *page, int order)
>>> {
>>> unsigned int i;
>>> 
>>> @@ -197,7 +197,7 @@ static void set_page_pfns(struct virtio_balloon *vb,
>>>  * Set balloon pfns pointing at this page.
>>>  * Note that the first pfn points at start of the page.
>>>  */
>>> -   for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE; i++)
>>> +   for (i = 0; i < VIRTIO_BALLOON_PAGES_PER_PAGE * (1 << order); i++)
>>> pfns[i] = cpu_to_virtio32(vb->vdev,
>>>   page_to_balloon_pfn(page) + i);
>>> }
>>> @@ 

Re: Balloon pressuring page cache

2020-02-04 Thread Nadav Amit via Virtualization
> On Feb 3, 2020, at 2:50 PM, Nadav Amit  wrote:
> 
>> On Feb 3, 2020, at 8:34 AM, David Hildenbrand  wrote:
>> 
>> On 03.02.20 17:18, Alexander Duyck wrote:
>>> On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
>>>> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
>>>>> On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W  wrote:
>>>>> 
>>>>>   On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
>>>>>> On 29.01.20 20:11, Tyler Sanderson wrote:
>>>>>>> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand >>>>>> <mailto:da...@redhat.com>> wrote:
>>>>>>> 
>>>>>>>   On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>>>>>>>> A primary advantage of virtio balloon over other memory reclaim
>>>>>>>> mechanisms is that it can pressure the guest's page cache into
>>>>>>>   shrinking.
>>>>>>>> However, since the balloon driver changed to using the shrinker
>>>>>   API
>>>>>> <https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9
>>>>>> e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>>>>>>>> use case has become a bit more tricky. I'm wondering what the
>>>>>> intended
>>>>>>>> device implementation is.
>>>>>>>> 
>>>>>>>> When inflating the balloon against page cache (i.e. no free
>>>>>   memory
>>>>>>>> remains) vmscan.c will both shrink page cache, but also invoke
>>>>>   the
>>>>>>>> shrinkers -- including the balloon's shrinker. So the balloon
>>>>>   driver
>>>>>>>> allocates memory which requires reclaim, vmscan gets this memory
>>>>>> by
>>>>>>>> shrinking the balloon, and then the driver adds the memory back
>>>>>   to
>>>>>> the
>>>>>>>> balloon. Basically a busy no-op.
>>>>> 
>>>>>   Per my understanding, the balloon allocation won’t invoke shrinker as
>>>>>   __GFP_DIRECT_RECLAIM isn't set, no?
>>>>> 
>>>>> I could be wrong about the mechanism, but the device sees lots of 
>>>>> activity on
>>>>> the deflate queue. The balloon is being shrunk. And this only starts once 
>>>>> all
>>>>> free memory is depleted and we're inflating into page cache.
>>>> 
>>>> So given this looks like a regression, maybe we should revert the
>>>> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with 
>>>> shrinker")
>>>> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
>>>> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
>>>> at all.
>>>> 
>>>> So it looks like all this rework introduced more issues than it
>>>> addressed ...
>>>> 
>>>> I also CC Alex Duyck for an opinion on this.
>>>> Alex, what do you use to put pressure on page cache?
>>> 
>>> I would say reverting probably makes sense. I'm not sure there is much
>>> value to having a shrinker running deflation when you are actively trying
>>> to increase the balloon. It would make more sense to wait until you are
>>> actually about to start hitting oom.
>> 
>> I think the shrinker makes sense for free page hinting feature
>> (everything on free_page_list).
>> 
>> So instead of only reverting, I think we should split it up and always
>> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
>> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
>> 
>> (Of course, adapting what is being done in the shrinker and in the OOM
>> notifier)
> 
> David,
> 
> Please keep me posted. I decided to adapt the same solution as the virtio
> balloon for the VMware balloon. If the verdict is that this is damaging and
> the OOM notifier should be used instead, I will submit patches to move to
> OOM notifier as well.

Adding some information for the record (if someone googles this thread):

In the VMware balloon driver, the shrinker is disabled by default since we
encountered a performance degradation in testing. I tried to avoid rapid
inflation/shrinker-deflation cycles by adding a timeout, but apparently it
did not help in avoiding the performance regression.

So there is no such issue in VMware balloon driver, unless someone
intentionally enables the shrinker through a module parameter.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Balloon pressuring page cache

2020-02-03 Thread Nadav Amit via Virtualization
> On Feb 3, 2020, at 8:34 AM, David Hildenbrand  wrote:
> 
> On 03.02.20 17:18, Alexander Duyck wrote:
>> On Mon, 2020-02-03 at 08:11 -0500, Michael S. Tsirkin wrote:
>>> On Thu, Jan 30, 2020 at 11:59:46AM -0800, Tyler Sanderson wrote:
 On Thu, Jan 30, 2020 at 7:31 AM Wang, Wei W  wrote:
 
On Thursday, January 30, 2020 11:03 PM, David Hildenbrand wrote:
> On 29.01.20 20:11, Tyler Sanderson wrote:
>> On Wed, Jan 29, 2020 at 2:31 AM David Hildenbrand > > wrote:
>> 
>>On 29.01.20 01:22, Tyler Sanderson via Virtualization wrote:
>>> A primary advantage of virtio balloon over other memory reclaim
>>> mechanisms is that it can pressure the guest's page cache into
>>shrinking.
>>> However, since the balloon driver changed to using the shrinker
API
>  e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9> this
>>> use case has become a bit more tricky. I'm wondering what the
> intended
>>> device implementation is.
>>> 
>>> When inflating the balloon against page cache (i.e. no free
memory
>>> remains) vmscan.c will both shrink page cache, but also invoke
the
>>> shrinkers -- including the balloon's shrinker. So the balloon
driver
>>> allocates memory which requires reclaim, vmscan gets this memory
> by
>>> shrinking the balloon, and then the driver adds the memory back
to
> the
>>> balloon. Basically a busy no-op.
 
Per my understanding, the balloon allocation won’t invoke shrinker as
__GFP_DIRECT_RECLAIM isn't set, no?
 
 I could be wrong about the mechanism, but the device sees lots of activity 
 on
 the deflate queue. The balloon is being shrunk. And this only starts once 
 all
 free memory is depleted and we're inflating into page cache.
>>> 
>>> So given this looks like a regression, maybe we should revert the
>>> patch in question 71994620bb25 ("virtio_balloon: replace oom notifier with 
>>> shrinker")
>>> Besides, with VIRTIO_BALLOON_F_FREE_PAGE_HINT
>>> shrinker also ignores VIRTIO_BALLOON_F_MUST_TELL_HOST which isn't nice
>>> at all.
>>> 
>>> So it looks like all this rework introduced more issues than it
>>> addressed ...
>>> 
>>> I also CC Alex Duyck for an opinion on this.
>>> Alex, what do you use to put pressure on page cache?
>> 
>> I would say reverting probably makes sense. I'm not sure there is much
>> value to having a shrinker running deflation when you are actively trying
>> to increase the balloon. It would make more sense to wait until you are
>> actually about to start hitting oom.
> 
> I think the shrinker makes sense for free page hinting feature
> (everything on free_page_list).
> 
> So instead of only reverting, I think we should split it up and always
> register the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT and the OOM
> notifier (as before) for VIRTIO_BALLOON_F_MUST_TELL_HOST.
> 
> (Of course, adapting what is being done in the shrinker and in the OOM
> notifier)

David,

Please keep me posted. I decided to adapt the same solution as the virtio
balloon for the VMware balloon. If the verdict is that this is damaging and
the OOM notifier should be used instead, I will submit patches to move to
OOM notifier as well.

Regards,
Nadav

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v4 0/9] x86/tlb: Concurrent TLB flushes

2019-08-24 Thread Nadav Amit via Virtualization
[ Similar cover-letter to v3 with updated performance numbers on skylake.
  Sorry for the time it since the last version. ]

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each PTE flushing can take 100s
of cycles. This patch-set allows TLB flushes to be run concurrently:
first request the remote CPUs to initiate the flush, then run it
locally, and finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid, for
example, unwarranted false-sharing.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 56-logical-cores (28+SMT) Skylake, 5 repetitions:

sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
 --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

 Th.   tip-aug22 avg (stdev)   +patch-set avg (stdev)  change
 ---   -   --  --
 1  1152920 (7453)  1169469 (9059)  +1.4%
 2  1545832 (12555) 1584172 (10484) +2.4%
 4  2480703 (12039) 2518641 (12875) +1.5%
 8  3684486 (26007) 3840343 (44144) +4.2%
 16 4981567 (23565) 5125756 (15458) +2.8%
 32 5679542 (10116) 5887826 (6121)  +3.6%
 56 5630944 (17937) 5812514 (26264) +3.2%

(Note that on configurations with up to 28 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

 Th.   tip-aug22 avg (stdev)   +patch-set avg (stdev)  change
 ---   -   --  --
 1  1444119 (8524)  1469606 (10527) +1.7%
 2  1921540 (24169) 1961899 (14450) +2.1%
 4  3073716 (21786) 3199880 (16774) +4.1%
 8  4700698 (49534) 4802312 (11043) +2.1%
 16 6005180 (6366)  6006656 (31624)0%
 32 6826466 (10496) 6886622 (19110) +0.8%
 56 6832344 (13468) 6885586 (20646) +0.8%

The results are somewhat different than the results that have been obtained on
Haswell-X, which were sent before and the maximum performance improvement is
smaller. However, the performance improvement is significant.

v3 -> v4:
  * Merge flush_tlb_func_local and flush_tlb_func_remote() [Peter]
  * Prevent preemption on_each_cpu(). It is not needed, but it prevents
concerns. [Peter/tglx]
  * Adding acked-, review-by tags

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Unify flush_tlb_func_local() and flush_tlb_func_remote()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c |  10 +-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  52 +++
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c |  11 +-
 arch/x86/kernel/paravirt.c|   2 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 195 ++
 arch/x86/xen/mmu_pv.c |  11 +-
 i

[PATCH v4 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-08-24 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Reviewed-by: Michael Kelley  # Hyper-v parts
Reviewed-by: Juergen Gross  # Xen and paravirt parts
Reviewed-by: Dave Hansen 
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  8 ++---
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 45 ++-
 arch/x86/xen/mmu_pv.c | 11 +++
 include/trace/events/xen.h|  2 +-
 10 files changed, 61 insertions(+), 40 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..8740d8b21db3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 69089d46f128..bc4829c9b3f9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-   const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 70b654f3ffe5..63fa751344bf 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -206,8 +206,8 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
-   void (*flush_tlb_others)(const struct cpumask *cpus,
-const struct flush_tlb_info *info);
+   void (*flush_tlb_multi)(co

Re: [PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 12:13 PM, David Hildenbrand  wrote:
> 
> On 21.08.19 18:34, Nadav Amit wrote:
>>> On Aug 21, 2019, at 9:29 AM, David Hildenbrand  wrote:
>>> 
>>> On 21.08.19 18:23, Nadav Amit wrote:
>>>>> On Aug 21, 2019, at 9:05 AM, David Hildenbrand  wrote:
>>>>> 
>>>>> On 20.08.19 11:16, Nadav Amit wrote:
>>>>>> There is no reason to print warnings when balloon page allocation fails,
>>>>>> as they are expected and can be handled gracefully.  Since VMware
>>>>>> balloon now uses balloon-compaction infrastructure, and suppressed these
>>>>>> warnings before, it is also beneficial to suppress these warnings to
>>>>>> keep the same behavior that the balloon had before.
>>>>> 
>>>>> I am not sure if that's a good idea. The allocation warnings are usually
>>>>> the only trace of "the user/admin did something bad because he/she tried
>>>>> to inflate the balloon to an unsafe value". Believe me, I processed a
>>>>> couple of such bugreports related to virtio-balloon and the warning were
>>>>> very helpful for that.
>>>> 
>>>> Ok, so a message is needed, but does it have to be a generic frightening
>>>> warning?
>>>> 
>>>> How about using __GFP_NOWARN, and if allocation do something like:
>>>> 
>>>> pr_warn(“Balloon memory allocation failed”);
>>>> 
>>>> Or even something more informative? This would surely be less intimidating
>>>> for common users.
>>> 
>>> ratelimit would make sense :)
>>> 
>>> And yes, this would certainly be nicer.
>> 
>> Thanks. I will post v2 of the patch.
> 
> As discussed in v2, we already print a warning in virtio-balloon, so I
> am fine with this patch.
> 
> Reviewed-by: David Hildenbrand 

Michael,

If it is possible to get it to 5.3, to avoid behavioral change for VMware
balloon users, it would be great.

Thanks,
Nadav
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 12:12 PM, David Hildenbrand  wrote:
> 
> On 21.08.19 21:10, Nadav Amit wrote:
>>> On Aug 21, 2019, at 12:06 PM, David Hildenbrand  wrote:
>>> 
>>> On 21.08.19 20:59, Nadav Amit wrote:
>>>>> On Aug 21, 2019, at 11:57 AM, David Hildenbrand  wrote:
>>>>> 
>>>>> On 21.08.19 11:41, Nadav Amit wrote:
>>>>>> There is no reason to print generic warnings when balloon memory
>>>>>> allocation fails, as failures are expected and can be handled
>>>>>> gracefully. Since VMware balloon now uses balloon-compaction
>>>>>> infrastructure, and suppressed these warnings before, it is also
>>>>>> beneficial to suppress these warnings to keep the same behavior that the
>>>>>> balloon had before.
>>>>>> 
>>>>>> Since such warnings can still be useful to indicate that the balloon is
>>>>>> over-inflated, print more informative and less frightening warning if
>>>>>> allocation fails instead.
>>>>>> 
>>>>>> Cc: David Hildenbrand 
>>>>>> Cc: Jason Wang 
>>>>>> Signed-off-by: Nadav Amit 
>>>>>> 
>>>>>> ---
>>>>>> 
>>>>>> v1->v2:
>>>>>> * Print informative warnings instead suppressing [David]
>>>>>> ---
>>>>>> mm/balloon_compaction.c | 7 ++-
>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>> 
>>>>>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>>>>>> index 798275a51887..0c1d1f7689f0 100644
>>>>>> --- a/mm/balloon_compaction.c
>>>>>> +++ b/mm/balloon_compaction.c
>>>>>> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>>>>>> struct page *balloon_page_alloc(void)
>>>>>> {
>>>>>>  struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>>>>>> -   __GFP_NOMEMALLOC | 
>>>>>> __GFP_NORETRY);
>>>>>> +   __GFP_NOMEMALLOC | __GFP_NORETRY 
>>>>>> |
>>>>>> +   __GFP_NOWARN);
>>>>>> +
>>>>>> +if (!page)
>>>>>> +pr_warn_ratelimited("memory balloon: memory allocation 
>>>>>> failed");
>>>>>> +
>>>>>>  return page;
>>>>>> }
>>>>>> EXPORT_SYMBOL_GPL(balloon_page_alloc);
>>>>> 
>>>>> Not sure if "memory balloon" is the right wording. hmmm.
>>>>> 
>>>>> Acked-by: David Hildenbrand 
>>>> 
>>>> Do you have a better suggestion?
>>> 
>>> Not really - that's why I ack'ed :)
>>> 
>>> However, thinking about it - what about moving the check + print to the
>>> caller and then using dev_warn... or sth. like simple "virtio_balloon:
>>> ..." ? You can then drop the warning for vmware balloon if you feel like
>>> not needing it.
>> 
>> Actually, there is already a warning that is printed by the virtue_balloon
>> in fill_balloon():
>> 
>>struct page *page = balloon_page_alloc();
>> 
>>if (!page) {
>>dev_info_ratelimited(>vdev->dev,
>> "Out of puff! Can't get %u 
>> pages\n",
>> VIRTIO_BALLOON_PAGES_PER_PAGE);
>>/* Sleep for at least 1/5 of a second before retry. */
>>msleep(200);
>>break;
>>}
>> 
>> So are you ok with going back to v1?
> 
> Whoops, I missed that - sorry - usually the warnings scream louder at me :D
> 
> Yes, v1 is fine with me!

Thanks, I missed this one too. This change should prevent making users
concerned for no good reason.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 12:06 PM, David Hildenbrand  wrote:
> 
> On 21.08.19 20:59, Nadav Amit wrote:
>>> On Aug 21, 2019, at 11:57 AM, David Hildenbrand  wrote:
>>> 
>>> On 21.08.19 11:41, Nadav Amit wrote:
>>>> There is no reason to print generic warnings when balloon memory
>>>> allocation fails, as failures are expected and can be handled
>>>> gracefully. Since VMware balloon now uses balloon-compaction
>>>> infrastructure, and suppressed these warnings before, it is also
>>>> beneficial to suppress these warnings to keep the same behavior that the
>>>> balloon had before.
>>>> 
>>>> Since such warnings can still be useful to indicate that the balloon is
>>>> over-inflated, print more informative and less frightening warning if
>>>> allocation fails instead.
>>>> 
>>>> Cc: David Hildenbrand 
>>>> Cc: Jason Wang 
>>>> Signed-off-by: Nadav Amit 
>>>> 
>>>> ---
>>>> 
>>>> v1->v2:
>>>> * Print informative warnings instead suppressing [David]
>>>> ---
>>>> mm/balloon_compaction.c | 7 ++-
>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>> 
>>>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>>>> index 798275a51887..0c1d1f7689f0 100644
>>>> --- a/mm/balloon_compaction.c
>>>> +++ b/mm/balloon_compaction.c
>>>> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>>>> struct page *balloon_page_alloc(void)
>>>> {
>>>>struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>>>> - __GFP_NOMEMALLOC | __GFP_NORETRY);
>>>> + __GFP_NOMEMALLOC | __GFP_NORETRY |
>>>> + __GFP_NOWARN);
>>>> +
>>>> +  if (!page)
>>>> +  pr_warn_ratelimited("memory balloon: memory allocation failed");
>>>> +
>>>>return page;
>>>> }
>>>> EXPORT_SYMBOL_GPL(balloon_page_alloc);
>>> 
>>> Not sure if "memory balloon" is the right wording. hmmm.
>>> 
>>> Acked-by: David Hildenbrand 
>> 
>> Do you have a better suggestion?
> 
> Not really - that's why I ack'ed :)
> 
> However, thinking about it - what about moving the check + print to the
> caller and then using dev_warn... or sth. like simple "virtio_balloon:
> ..." ? You can then drop the warning for vmware balloon if you feel like
> not needing it.

Actually, there is already a warning that is printed by the virtue_balloon
in fill_balloon():

struct page *page = balloon_page_alloc();

if (!page) {
dev_info_ratelimited(>vdev->dev,
 "Out of puff! Can't get %u 
pages\n",
 VIRTIO_BALLOON_PAGES_PER_PAGE);
/* Sleep for at least 1/5 of a second before retry. */
msleep(200);
break;
}

So are you ok with going back to v1?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 11:57 AM, David Hildenbrand  wrote:
> 
> On 21.08.19 11:41, Nadav Amit wrote:
>> There is no reason to print generic warnings when balloon memory
>> allocation fails, as failures are expected and can be handled
>> gracefully. Since VMware balloon now uses balloon-compaction
>> infrastructure, and suppressed these warnings before, it is also
>> beneficial to suppress these warnings to keep the same behavior that the
>> balloon had before.
>> 
>> Since such warnings can still be useful to indicate that the balloon is
>> over-inflated, print more informative and less frightening warning if
>> allocation fails instead.
>> 
>> Cc: David Hildenbrand 
>> Cc: Jason Wang 
>> Signed-off-by: Nadav Amit 
>> 
>> ---
>> 
>> v1->v2:
>>  * Print informative warnings instead suppressing [David]
>> ---
>> mm/balloon_compaction.c | 7 ++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>> 
>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>> index 798275a51887..0c1d1f7689f0 100644
>> --- a/mm/balloon_compaction.c
>> +++ b/mm/balloon_compaction.c
>> @@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
>> struct page *balloon_page_alloc(void)
>> {
>>  struct page *page = alloc_page(balloon_mapping_gfp_mask() |
>> -   __GFP_NOMEMALLOC | __GFP_NORETRY);
>> +   __GFP_NOMEMALLOC | __GFP_NORETRY |
>> +   __GFP_NOWARN);
>> +
>> +if (!page)
>> +pr_warn_ratelimited("memory balloon: memory allocation failed");
>> +
>>  return page;
>> }
>> EXPORT_SYMBOL_GPL(balloon_page_alloc);
> 
> Not sure if "memory balloon" is the right wording. hmmm.
> 
> Acked-by: David Hildenbrand 

Do you have a better suggestion?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2] mm/balloon_compaction: Informative allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
There is no reason to print generic warnings when balloon memory
allocation fails, as failures are expected and can be handled
gracefully. Since VMware balloon now uses balloon-compaction
infrastructure, and suppressed these warnings before, it is also
beneficial to suppress these warnings to keep the same behavior that the
balloon had before.

Since such warnings can still be useful to indicate that the balloon is
over-inflated, print more informative and less frightening warning if
allocation fails instead.

Cc: David Hildenbrand 
Cc: Jason Wang 
Signed-off-by: Nadav Amit 

---

v1->v2:
  * Print informative warnings instead suppressing [David]
---
 mm/balloon_compaction.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 798275a51887..0c1d1f7689f0 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -124,7 +124,12 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
 struct page *balloon_page_alloc(void)
 {
struct page *page = alloc_page(balloon_mapping_gfp_mask() |
-  __GFP_NOMEMALLOC | __GFP_NORETRY);
+  __GFP_NOMEMALLOC | __GFP_NORETRY |
+  __GFP_NOWARN);
+
+   if (!page)
+   pr_warn_ratelimited("memory balloon: memory allocation failed");
+
return page;
 }
 EXPORT_SYMBOL_GPL(balloon_page_alloc);
-- 
2.17.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 9:29 AM, David Hildenbrand  wrote:
> 
> On 21.08.19 18:23, Nadav Amit wrote:
>>> On Aug 21, 2019, at 9:05 AM, David Hildenbrand  wrote:
>>> 
>>> On 20.08.19 11:16, Nadav Amit wrote:
>>>> There is no reason to print warnings when balloon page allocation fails,
>>>> as they are expected and can be handled gracefully.  Since VMware
>>>> balloon now uses balloon-compaction infrastructure, and suppressed these
>>>> warnings before, it is also beneficial to suppress these warnings to
>>>> keep the same behavior that the balloon had before.
>>> 
>>> I am not sure if that's a good idea. The allocation warnings are usually
>>> the only trace of "the user/admin did something bad because he/she tried
>>> to inflate the balloon to an unsafe value". Believe me, I processed a
>>> couple of such bugreports related to virtio-balloon and the warning were
>>> very helpful for that.
>> 
>> Ok, so a message is needed, but does it have to be a generic frightening
>> warning?
>> 
>> How about using __GFP_NOWARN, and if allocation do something like:
>> 
>>  pr_warn(“Balloon memory allocation failed”);
>> 
>> Or even something more informative? This would surely be less intimidating
>> for common users.
> 
> ratelimit would make sense :)
> 
> And yes, this would certainly be nicer.

Thanks. I will post v2 of the patch.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-21 Thread Nadav Amit via Virtualization
> On Aug 21, 2019, at 9:05 AM, David Hildenbrand  wrote:
> 
> On 20.08.19 11:16, Nadav Amit wrote:
>> There is no reason to print warnings when balloon page allocation fails,
>> as they are expected and can be handled gracefully.  Since VMware
>> balloon now uses balloon-compaction infrastructure, and suppressed these
>> warnings before, it is also beneficial to suppress these warnings to
>> keep the same behavior that the balloon had before.
> 
> I am not sure if that's a good idea. The allocation warnings are usually
> the only trace of "the user/admin did something bad because he/she tried
> to inflate the balloon to an unsafe value". Believe me, I processed a
> couple of such bugreports related to virtio-balloon and the warning were
> very helpful for that.

Ok, so a message is needed, but does it have to be a generic frightening
warning?

How about using __GFP_NOWARN, and if allocation do something like:

  pr_warn(“Balloon memory allocation failed”);

Or even something more informative? This would surely be less intimidating
for common users.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] mm/balloon_compaction: suppress allocation warnings

2019-08-20 Thread Nadav Amit via Virtualization
There is no reason to print warnings when balloon page allocation fails,
as they are expected and can be handled gracefully.  Since VMware
balloon now uses balloon-compaction infrastructure, and suppressed these
warnings before, it is also beneficial to suppress these warnings to
keep the same behavior that the balloon had before.

Cc: Jason Wang 
Signed-off-by: Nadav Amit 
---
 mm/balloon_compaction.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 798275a51887..26de020aae7b 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -124,7 +124,8 @@ EXPORT_SYMBOL_GPL(balloon_page_list_dequeue);
 struct page *balloon_page_alloc(void)
 {
struct page *page = alloc_page(balloon_mapping_gfp_mask() |
-  __GFP_NOMEMALLOC | __GFP_NORETRY);
+  __GFP_NOMEMALLOC | __GFP_NORETRY |
+  __GFP_NOWARN);
return page;
 }
 EXPORT_SYMBOL_GPL(balloon_page_alloc);
-- 
2.19.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-22 Thread Nadav Amit via Virtualization
> On Jul 22, 2019, at 12:14 PM, Peter Zijlstra  wrote:
> 
> On Thu, Jul 18, 2019 at 05:58:32PM -0700, Nadav Amit wrote:
>> @@ -709,8 +716,9 @@ void native_flush_tlb_others(const struct cpumask 
>> *cpumask,
>>   * doing a speculative memory access.
>>   */
>>  if (info->freed_tables) {
>> -smp_call_function_many(cpumask, flush_tlb_func_remote,
>> -   (void *)info, 1);
>> +__smp_call_function_many(cpumask, flush_tlb_func_remote,
>> + flush_tlb_func_local,
>> + (void *)info, 1);
>>  } else {
>>  /*
>>   * Although we could have used on_each_cpu_cond_mask(),
>> @@ -737,7 +745,8 @@ void native_flush_tlb_others(const struct cpumask 
>> *cpumask,
>>  if (tlb_is_not_lazy(cpu))
>>  __cpumask_set_cpu(cpu, cond_cpumask);
>>  }
>> -smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> +__smp_call_function_many(cond_cpumask, flush_tlb_func_remote,
>> + flush_tlb_func_local,
>>   (void *)info, 1);
>>  }
>> }
> 
> Do we really need that _local/_remote distinction? ISTR you had a patch
> that frobbed flush_tlb_info into the csd and that gave space
> constraints, but I'm not seeing that here (probably a wise, get stuff
> merged etc..).
> 
> struct __call_single_data {
>struct llist_node  llist;/* 0 8 */
>smp_call_func_tfunc; /* 8 8 */
>void * info; /*16 8 */
>unsigned int   flags;/*24 4 */
> 
>/* size: 32, cachelines: 1, members: 4 */
>/* padding: 4 */
>/* last cacheline: 32 bytes */
> };
> 
> struct flush_tlb_info {
>struct mm_struct * mm;   /* 0 8 */
>long unsigned int  start;/* 8 8 */
>long unsigned int  end;  /*16 8 */
>u64new_tlb_gen;  /*24 8 */
>unsigned int   stride_shift; /*32 4 */
>bool   freed_tables; /*36 1 */
> 
>/* size: 40, cachelines: 1, members: 6 */
>/* padding: 3 */
>/* last cacheline: 40 bytes */
> };
> 
> IIRC what you did was make void *__call_single_data::info the last
> member and a union until the full cacheline size (64). Given the above
> that would get us 24 bytes for csd, leaving us 40 for that
> flush_tlb_info.
> 
> But then we can still do something like the below, which doesn't change
> things and still gets rid of that dual function crud, simplifying
> smp_call_function_many again.
> 
> Index: linux-2.6/arch/x86/include/asm/tlbflush.h
> ===
> --- linux-2.6.orig/arch/x86/include/asm/tlbflush.h
> +++ linux-2.6/arch/x86/include/asm/tlbflush.h
> @@ -546,8 +546,9 @@ struct flush_tlb_info {
>   unsigned long   start;
>   unsigned long   end;
>   u64 new_tlb_gen;
> - unsigned intstride_shift;
> - boolfreed_tables;
> + unsigned intcpu;
> + unsigned short  stride_shift;
> + unsigned char   freed_tables;
> };
> 
> #define local_flush_tlb() __flush_tlb()
> Index: linux-2.6/arch/x86/mm/tlb.c
> ===
> --- linux-2.6.orig/arch/x86/mm/tlb.c
> +++ linux-2.6/arch/x86/mm/tlb.c
> @@ -659,6 +659,27 @@ static void flush_tlb_func_remote(void *
>   flush_tlb_func_common(f, false, TLB_REMOTE_SHOOTDOWN);
> }
> 
> +static void flush_tlb_func(void *info)
> +{
> + const struct flush_tlb_info *f = info;
> + enum tlb_flush_reason reason = TLB_REMOTE_SHOOTDOWN;
> + bool local = false;
> +
> + if (f->cpu == smp_processor_id()) {
> + local = true;
> + reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : 
> TLB_LOCAL_MM_SHOOTDOWN;
> + } else {
> + inc_irq_stat(irq_tlb_count);
> +
> + if (f->mm && f->mm != this_cpu_read(cpu_tlbstate.loaded_mm))
> + return;
> +
> + count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
> + }
> +
> + flush_tlb_func_common(f, local, 

[PATCH v3 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-18 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 10 +++---
 arch/x86/include/asm/paravirt.h   |  6 ++--
 arch/x86/include/asm/paravirt_types.h |  4 +--
 arch/x86/include/asm/tlbflush.h   |  8 ++---
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 47 ++-
 arch/x86/xen/mmu_pv.c | 11 +++
 include/trace/events/xen.h|  2 +-
 10 files changed, 62 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..8740d8b21db3 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -156,7 +156,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +231,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index dce26f1d13e1..8c6c2394393b 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-   const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 639b2df445ee..c82969f38845 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,8 +211,8 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
-   void (*flush_tlb_others)(const struct cpumask *cpus,
-const struct flush_tlb_info *info);
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   const struct flush_tlb_info *info);
 
void (*tlb_remove_table)(struct mmu_

[PATCH v3 0/9] x86: Concurrent TLB flushes

2019-07-18 Thread Nadav Amit via Virtualization
[ Cover-letter is identical to v2, including benchmark results,
  excluding the change log. ] 

Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions:

 sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
  --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   -   --  --
  1 1267765 (14146) 1299253 (5715)  +2.4%
  2 1734644 (11936) 1799225 (19577) +3.7%
  4 2821268 (41184) 2919132 (40149) +3.4%
  8 4171652 (31243) 4376925 (65416) +4.9%
  165590729 (24160) 5829866 (8127)  +4.2%
  246250212 (24481) 6522303 (28044) +4.3%
  323994314 (26606) 4077543 (10685) +2.0%
  484345177 (28091) 4417821 (41337) +1.6%

(Note that on configurations with up to 24 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   -   --  --
  1 1598896 (5174)  1607903 (4091)  +0.5%
  2 2109472 (17827) 2224726 (4372)  +5.4%
  4 3448587 (11952) 3668551 (30219) +6.3%
  8 5425778 (29641) 5606266 (33519) +3.3%
  166931232 (34677) 7054052 (27873) +1.7%
  247612473 (23482) 7783138 (13871) +2.2%
  324296274 (18029) 4283279 (32323) -0.3%
  484770029 (35541) 4764760 (13575) -0.1%

Presumably, PTI requires two invalidations of each mapping, which allows
to get higher benefits from concurrency when PTI is on. At the same
time, when mitigations are on, other overheads reduce the potential
speedup.

I tried to reduce the size of the code of the main patch, which required
restructuring of the series.

v2 -> v3:
* Open-code the remote/local-flush decision code [Andy]
* Fix hyper-v, Xen implementations [Andrew]
* Fix redundant TLB flushes.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c |  10 +-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  47 -
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c |  11 ++-
 arch/x86/kernel/paravirt.c|   2 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 133 --
 arch/x86/xen/mmu_pv.c |  11 +--
 incl

Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Nadav Amit via Virtualization
> On Jul 15, 2019, at 4:30 PM, Andrew Cooper  wrote:
> 
> On 15/07/2019 19:17, Nadav Amit wrote:
>>> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  
>>> wrote:
>>> 
>>> There is a lot of infrastructure for functionality which is used
>>> exclusively in __{save,restore}_processor_state() on the suspend/resume
>>> path.
>>> 
>>> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
>>> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
>>> rest of the Local APIC state isn't a clever thing to be doing.
>>> 
>>> Delete the suspend/resume cr8 handling, which shrinks the size of struct
>>> saved_context, and allows for the removal of both PVOPS.
>> I think removing the interface for CR8 writes is also good to avoid
>> potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
>> Priorities between CR8 and APIC”):
>> 
>> "Operating software should implement either direct APIC TPR updates or CR8
>> style TPR updates but not mix them. Software can use a serializing
>> instruction (for example, CPUID) to serialize updates between MOV CR8 and
>> stores to the APIC.”
>> 
>> And native_write_cr8() did not even issue a serializing instruction.
> 
> Given its location, the one write_cr8() is bounded by two serialising
> operations, so is safe in practice.

That’s what the “potential” in "potential correctness issues” means :)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] x86/paravirt: Drop {read,write}_cr8() hooks

2019-07-15 Thread Nadav Amit via Virtualization
> On Jul 15, 2019, at 8:16 AM, Andrew Cooper  wrote:
> 
> There is a lot of infrastructure for functionality which is used
> exclusively in __{save,restore}_processor_state() on the suspend/resume
> path.
> 
> cr8 is an alias of APIC_TASKPRI, and APIC_TASKPRI is saved/restored by
> lapic_{suspend,resume}().  Saving and restoring cr8 independently of the
> rest of the Local APIC state isn't a clever thing to be doing.
> 
> Delete the suspend/resume cr8 handling, which shrinks the size of struct
> saved_context, and allows for the removal of both PVOPS.

I think removing the interface for CR8 writes is also good to avoid
potential correctness issues, as the SDM says (10.8.6.1 "Interaction of Task
Priorities between CR8 and APIC”):

"Operating software should implement either direct APIC TPR updates or CR8
style TPR updates but not mix them. Software can use a serializing
instruction (for example, CPUID) to serialize updates between MOV CR8 and
stores to the APIC.”

And native_write_cr8() did not even issue a serializing instruction.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [Xen-devel] [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-03 Thread Nadav Amit via Virtualization
> On Jul 3, 2019, at 10:43 AM, Andrew Cooper  wrote:
> 
> On 03/07/2019 18:02, Nadav Amit wrote:
>>> On Jul 3, 2019, at 7:04 AM, Juergen Gross  wrote:
>>> 
>>> On 03.07.19 01:51, Nadav Amit wrote:
>>>> To improve TLB shootdown performance, flush the remote and local TLBs
>>>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>>>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>>>> and hyper-v are only compile-tested).
>>>> While the updated smp infrastructure is capable of running a function on
>>>> a single local core, it is not optimized for this case. The multiple
>>>> function calls and the indirect branch introduce some overhead, and
>>>> might make local TLB flushes slower than they were before the recent
>>>> changes.
>>>> Before calling the SMP infrastructure, check if only a local TLB flush
>>>> is needed to restore the lost performance in this common case. This
>>>> requires to check mm_cpumask() one more time, but unless this mask is
>>>> updated very frequently, this should impact performance negatively.
>>>> Cc: "K. Y. Srinivasan" 
>>>> Cc: Haiyang Zhang 
>>>> Cc: Stephen Hemminger 
>>>> Cc: Sasha Levin 
>>>> Cc: Thomas Gleixner 
>>>> Cc: Ingo Molnar 
>>>> Cc: Borislav Petkov 
>>>> Cc: x...@kernel.org
>>>> Cc: Juergen Gross 
>>>> Cc: Paolo Bonzini 
>>>> Cc: Dave Hansen 
>>>> Cc: Andy Lutomirski 
>>>> Cc: Peter Zijlstra 
>>>> Cc: Boris Ostrovsky 
>>>> Cc: linux-hyp...@vger.kernel.org
>>>> Cc: linux-ker...@vger.kernel.org
>>>> Cc: virtualization@lists.linux-foundation.org
>>>> Cc: k...@vger.kernel.org
>>>> Cc: xen-de...@lists.xenproject.org
>>>> Signed-off-by: Nadav Amit 
>>>> ---
>>>> arch/x86/hyperv/mmu.c | 13 +++---
>>>> arch/x86/include/asm/paravirt.h   |  6 +--
>>>> arch/x86/include/asm/paravirt_types.h |  4 +-
>>>> arch/x86/include/asm/tlbflush.h   |  9 ++--
>>>> arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>>> arch/x86/kernel/kvm.c | 11 +++--
>>>> arch/x86/kernel/paravirt.c|  2 +-
>>>> arch/x86/mm/tlb.c | 65 ---
>>>> arch/x86/xen/mmu_pv.c | 20 ++---
>>>> include/trace/events/xen.h|  2 +-
>>>> 10 files changed, 91 insertions(+), 43 deletions(-)
>>> ...
>>> 
>>>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>>>> index beb44e22afdf..19e481e6e904 100644
>>>> --- a/arch/x86/xen/mmu_pv.c
>>>> +++ b/arch/x86/xen/mmu_pv.c
>>>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long 
>>>> addr)
>>>>preempt_enable();
>>>> }
>>>> -static void xen_flush_tlb_others(const struct cpumask *cpus,
>>>> -   const struct flush_tlb_info *info)
>>>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>>>> +  const struct flush_tlb_info *info)
>>>> {
>>>>struct {
>>>>struct mmuext_op op;
>>>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct 
>>>> cpumask *cpus,
>>>>const size_t mc_entry_size = sizeof(args->op) +
>>>>sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>>> -  trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>>>> +  trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>>>if (cpumask_empty(cpus))
>>>>return; /* nothing to do */
>>>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct 
>>>> cpumask *cpus,
>>>>args = mcs.args;
>>>>args->op.arg2.vcpumask = to_cpumask(args->mask);
>>>> -  /* Remove us, and any offline CPUS. */
>>>> +  /* Flush locally if needed and remove us */
>>>> +  if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>>>> +  local_irq_disable();
>>>> +  flush_tlb_func_local(info);
>>> I think this isn't the correct function for PV guests.
>>> 
>>> In fact it should be much easier: just don't clear the own cpu 

Re: [PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-03 Thread Nadav Amit via Virtualization
> On Jul 3, 2019, at 7:04 AM, Juergen Gross  wrote:
> 
> On 03.07.19 01:51, Nadav Amit wrote:
>> To improve TLB shootdown performance, flush the remote and local TLBs
>> concurrently. Introduce flush_tlb_multi() that does so. Introduce
>> paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
>> and hyper-v are only compile-tested).
>> While the updated smp infrastructure is capable of running a function on
>> a single local core, it is not optimized for this case. The multiple
>> function calls and the indirect branch introduce some overhead, and
>> might make local TLB flushes slower than they were before the recent
>> changes.
>> Before calling the SMP infrastructure, check if only a local TLB flush
>> is needed to restore the lost performance in this common case. This
>> requires to check mm_cpumask() one more time, but unless this mask is
>> updated very frequently, this should impact performance negatively.
>> Cc: "K. Y. Srinivasan" 
>> Cc: Haiyang Zhang 
>> Cc: Stephen Hemminger 
>> Cc: Sasha Levin 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> Cc: Borislav Petkov 
>> Cc: x...@kernel.org
>> Cc: Juergen Gross 
>> Cc: Paolo Bonzini 
>> Cc: Dave Hansen 
>> Cc: Andy Lutomirski 
>> Cc: Peter Zijlstra 
>> Cc: Boris Ostrovsky 
>> Cc: linux-hyp...@vger.kernel.org
>> Cc: linux-ker...@vger.kernel.org
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: k...@vger.kernel.org
>> Cc: xen-de...@lists.xenproject.org
>> Signed-off-by: Nadav Amit 
>> ---
>>  arch/x86/hyperv/mmu.c | 13 +++---
>>  arch/x86/include/asm/paravirt.h   |  6 +--
>>  arch/x86/include/asm/paravirt_types.h |  4 +-
>>  arch/x86/include/asm/tlbflush.h   |  9 ++--
>>  arch/x86/include/asm/trace/hyperv.h   |  2 +-
>>  arch/x86/kernel/kvm.c | 11 +++--
>>  arch/x86/kernel/paravirt.c|  2 +-
>>  arch/x86/mm/tlb.c | 65 ---
>>  arch/x86/xen/mmu_pv.c | 20 ++---
>>  include/trace/events/xen.h|  2 +-
>>  10 files changed, 91 insertions(+), 43 deletions(-)
> 
> ...
> 
>> diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
>> index beb44e22afdf..19e481e6e904 100644
>> --- a/arch/x86/xen/mmu_pv.c
>> +++ b/arch/x86/xen/mmu_pv.c
>> @@ -1355,8 +1355,8 @@ static void xen_flush_tlb_one_user(unsigned long addr)
>>  preempt_enable();
>>  }
>>  -static void xen_flush_tlb_others(const struct cpumask *cpus,
>> - const struct flush_tlb_info *info)
>> +static void xen_flush_tlb_multi(const struct cpumask *cpus,
>> +const struct flush_tlb_info *info)
>>  {
>>  struct {
>>  struct mmuext_op op;
>> @@ -1366,7 +1366,7 @@ static void xen_flush_tlb_others(const struct cpumask 
>> *cpus,
>>  const size_t mc_entry_size = sizeof(args->op) +
>>  sizeof(args->mask[0]) * BITS_TO_LONGS(num_possible_cpus());
>>  -   trace_xen_mmu_flush_tlb_others(cpus, info->mm, info->start, info->end);
>> +trace_xen_mmu_flush_tlb_multi(cpus, info->mm, info->start, info->end);
>>  if (cpumask_empty(cpus))
>>  return; /* nothing to do */
>> @@ -1375,9 +1375,17 @@ static void xen_flush_tlb_others(const struct cpumask 
>> *cpus,
>>  args = mcs.args;
>>  args->op.arg2.vcpumask = to_cpumask(args->mask);
>>  -   /* Remove us, and any offline CPUS. */
>> +/* Flush locally if needed and remove us */
>> +if (cpumask_test_cpu(smp_processor_id(), to_cpumask(args->mask))) {
>> +local_irq_disable();
>> +flush_tlb_func_local(info);
> 
> I think this isn't the correct function for PV guests.
> 
> In fact it should be much easier: just don't clear the own cpu from the
> mask, that's all what's needed. The hypervisor is just fine having the
> current cpu in the mask and it will do the right thing.

Thanks. I will do so in v3. I don’t think Hyper-V people would want to do
the same, unfortunately, since it would induce VM-exit on TLB flushes. But
if they do - I’ll be able not to expose flush_tlb_func_local().

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-07-03 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. Introduce
paravirtual versions of flush_tlb_multi() for KVM, Xen and hyper-v (Xen
and hyper-v are only compile-tested).

While the updated smp infrastructure is capable of running a function on
a single local core, it is not optimized for this case. The multiple
function calls and the indirect branch introduce some overhead, and
might make local TLB flushes slower than they were before the recent
changes.

Before calling the SMP infrastructure, check if only a local TLB flush
is needed to restore the lost performance in this common case. This
requires to check mm_cpumask() one more time, but unless this mask is
updated very frequently, this should impact performance negatively.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c | 13 +++---
 arch/x86/include/asm/paravirt.h   |  6 +--
 arch/x86/include/asm/paravirt_types.h |  4 +-
 arch/x86/include/asm/tlbflush.h   |  9 ++--
 arch/x86/include/asm/trace/hyperv.h   |  2 +-
 arch/x86/kernel/kvm.c | 11 +++--
 arch/x86/kernel/paravirt.c|  2 +-
 arch/x86/mm/tlb.c | 65 ---
 arch/x86/xen/mmu_pv.c | 20 ++---
 include/trace/events/xen.h|  2 +-
 10 files changed, 91 insertions(+), 43 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..1177f863e4cd 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -50,8 +50,8 @@ static inline int fill_gva_list(u64 gva_list[], int offset,
return gva_n - offset;
 }
 
-static void hyperv_flush_tlb_others(const struct cpumask *cpus,
-   const struct flush_tlb_info *info)
+static void hyperv_flush_tlb_multi(const struct cpumask *cpus,
+  const struct flush_tlb_info *info)
 {
int cpu, vcpu, gva_n, max_gvas;
struct hv_tlb_flush **flush_pcpu;
@@ -59,7 +59,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
u64 status = U64_MAX;
unsigned long flags;
 
-   trace_hyperv_mmu_flush_tlb_others(cpus, info);
+   trace_hyperv_mmu_flush_tlb_multi(cpus, info);
 
if (!hv_hypercall_pg)
goto do_native;
@@ -69,6 +69,9 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
 
local_irq_save(flags);
 
+   if (cpumask_test_cpu(smp_processor_id(), cpus))
+   flush_tlb_func_local(info);
+
flush_pcpu = (struct hv_tlb_flush **)
 this_cpu_ptr(hyperv_pcpu_input_arg);
 
@@ -156,7 +159,7 @@ static void hyperv_flush_tlb_others(const struct cpumask 
*cpus,
if (!(status & HV_HYPERCALL_RESULT_MASK))
return;
 do_native:
-   native_flush_tlb_others(cpus, info);
+   native_flush_tlb_multi(cpus, info);
 }
 
 static u64 hyperv_flush_tlb_others_ex(const struct cpumask *cpus,
@@ -231,6 +234,6 @@ void hyperv_setup_mmu_ops(void)
return;
 
pr_info("Using hypercall for remote TLB flush\n");
-   pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
+   pv_ops.mmu.flush_tlb_multi = hyperv_flush_tlb_multi;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..316959e89258 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -62,10 +62,10 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
-static inline void flush_tlb_others(const struct cpumask *cpumask,
-   const struct flush_tlb_info *info)
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
 {
-   PVOP_VCALL2(mmu.flush_tlb_others, cpumask, info);
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
 }
 
 static inline void paravirt_tlb_remove_table(struct mmu_gather *tlb, void 
*table)
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..54f4c718b5b0 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,8 +211,8 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_

[PATCH v2 0/9] x86: Concurrent TLB flushes

2019-07-03 Thread Nadav Amit via Virtualization
Currently, local and remote TLB flushes are not performed concurrently,
which introduces unnecessary overhead - each INVLPG can take 100s of
cycles. This patch-set allows TLB flushes to be run concurrently: first
request the remote CPUs to initiate the flush, then run it locally, and
finally wait for the remote CPUs to finish their work.

In addition, there are various small optimizations to avoid unwarranted
false-sharing and atomic operations.

The proposed changes should also improve the performance of other
invocations of on_each_cpu(). Hopefully, no one has relied on this
behavior of on_each_cpu() that invoked functions first remotely and only
then locally [Peter says he remembers someone might do so, but without
further information it is hard to know how to address it].

Running sysbench on dax/ext4 w/emulated-pmem, write-cache disabled on
2-socket, 48-logical-cores (24+SMT) Haswell-X, 5 repetitions:

 sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
  --file-io-mode=mmap --threads=X --file-fsync-mode=fdatasync run

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   -   --  --
  1 1267765 (14146) 1299253 (5715)  +2.4%
  2 1734644 (11936) 1799225 (19577) +3.7%
  4 2821268 (41184) 2919132 (40149) +3.4%
  8 4171652 (31243) 4376925 (65416) +4.9%
  165590729 (24160) 5829866 (8127)  +4.2%
  246250212 (24481) 6522303 (28044) +4.3%
  323994314 (26606) 4077543 (10685) +2.0%
  484345177 (28091) 4417821 (41337) +1.6%

(Note that on configurations with up to 24 threads numactl was used to
set all threads on socket 1, which explains the drop in performance when
going to 32 threads).

Running the same benchmark with security mitigations disabled (PTI,
Spectre, MDS):

  Th.   tip-jun28 avg (stdev)   +patch-set avg (stdev)  change
  ---   -   --  --
  1 1598896 (5174)  1607903 (4091)  +0.5%
  2 2109472 (17827) 2224726 (4372)  +5.4%
  4 3448587 (11952) 3668551 (30219) +6.3%
  8 5425778 (29641) 5606266 (33519) +3.3%
  166931232 (34677) 7054052 (27873) +1.7%
  247612473 (23482) 7783138 (13871) +2.2%
  324296274 (18029) 4283279 (32323) -0.3%
  484770029 (35541) 4764760 (13575) -0.1%

Presumably, PTI requires two invalidations of each mapping, which allows
to get higher benefits from concurrency when PTI is on. At the same
time, when mitigations are on, other overheads reduce the potential
speedup.

I tried to reduce the size of the code of the main patch, which required
restructuring of the series.

v1 -> v2:
* Removing the patches that Thomas took [tglx]
* Adding hyper-v, Xen compile-tested implementations [Dave]
* Removing UV [Andy]
* Adding lazy optimization, removing inline keyword [Dave]
* Restructuring patch-set

RFCv2 -> v1:
* Fix comment on flush_tlb_multi [Juergen]
* Removing async invalidation optimizations [Andy]
* Adding KVM support [Paolo]

Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Boris Ostrovsky 
Cc: Dave Hansen 
Cc: Haiyang Zhang 
Cc: Ingo Molnar 
Cc: Josh Poimboeuf 
Cc: Juergen Gross 
Cc: "K. Y. Srinivasan" 
Cc: Paolo Bonzini 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Sasha Levin 
Cc: Stephen Hemminger 
Cc: Thomas Gleixner 
Cc: k...@vger.kernel.org
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: x...@kernel.org
Cc: xen-de...@lists.xenproject.org

Nadav Amit (9):
  smp: Run functions concurrently in smp_call_function_many()
  x86/mm/tlb: Remove reason as argument for flush_tlb_func_local()
  x86/mm/tlb: Open-code on_each_cpu_cond_mask() for tlb_is_not_lazy()
  x86/mm/tlb: Flush remote and local TLBs concurrently
  x86/mm/tlb: Privatize cpu_tlbstate
  x86/mm/tlb: Do not make is_lazy dirty for no reason
  cpumask: Mark functions as pure
  x86/mm/tlb: Remove UV special case
  x86/mm/tlb: Remove unnecessary uses of the inline keyword

 arch/x86/hyperv/mmu.c |  13 ++-
 arch/x86/include/asm/paravirt.h   |   6 +-
 arch/x86/include/asm/paravirt_types.h |   4 +-
 arch/x86/include/asm/tlbflush.h   |  48 +
 arch/x86/include/asm/trace/hyperv.h   |   2 +-
 arch/x86/kernel/kvm.c |  11 +-
 arch/x86/kernel/paravirt.c|   2 +-
 arch/x86/mm/init.c|   2 +-
 arch/x86/mm/tlb.c | 147 --
 arch/x86/xen/mmu_pv.c |  20 ++--
 include/linux/cpumask.h   |   6 +-
 include/linux/smp.h   |  27 +++--
 include/trace/events/xen.h|   2 +-
 kernel/smp.c  | 133 +++
 14 files changed, 245 insertions(+

Re: [PATCH 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-06-25 Thread Nadav Amit via Virtualization
> On Jun 25, 2019, at 8:36 PM, Andy Lutomirski  wrote:
> 
> On Wed, Jun 12, 2019 at 11:49 PM Nadav Amit  wrote:
>> To improve TLB shootdown performance, flush the remote and local TLBs
>> concurrently. Introduce flush_tlb_multi() that does so. The current
>> flush_tlb_others() interface is kept, since paravirtual interfaces need
>> to be adapted first before it can be removed. This is left for future
>> work. In such PV environments, TLB flushes are not performed, at this
>> time, concurrently.
> 
> Would it be straightforward to have a default PV flush_tlb_multi()
> that uses flush_tlb_others() under the hood?

I prefer not to have a default PV implementation that should anyhow go away.

I can create unoptimized untested versions for Xen and Hyper-V, if you want.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-06-25 Thread Nadav Amit via Virtualization
> On Jun 25, 2019, at 8:00 PM, Dave Hansen  wrote:
> 
> On 6/25/19 7:35 PM, Nadav Amit wrote:
>>>> const struct flush_tlb_info *f = info;
>>>> +  enum tlb_flush_reason reason;
>>>> +
>>>> +  reason = (f->mm == NULL) ? TLB_LOCAL_SHOOTDOWN : TLB_LOCAL_MM_SHOOTDOWN;
>>> 
>>> Should we just add the "reason" to flush_tlb_info?  It's OK-ish to imply
>>> it like this, but seems like it would be nicer and easier to track down
>>> the origins of these things if we did this at the caller.
>> 
>> I prefer not to. I want later to inline flush_tlb_info into the same
>> cacheline that holds call_function_data. Increasing the size of
>> flush_tlb_info for no good reason will not help…
> 
> Well, flush_tlb_info is at 6/8ths of a cacheline at the moment.
> call_function_data is 3/8ths.  To me, that means we have some slack in
> the size.

I do not understand your math.. :(

6 + 3 > 8 so putting both flush_tlb_info and call_function_data does not
leave us any slack (we can save one qword, so we can actually put them
at the same cacheline).

You can see my current implementation here:

https://lore.kernel.org/lkml/20190531063645.4697-4-na...@vmware.com/T/#m0ab5fe0799ba9ff0d41197f1095679fe26aebd57
https://lore.kernel.org/lkml/20190531063645.4697-4-na...@vmware.com/T/#m7b35a93dffd23fbb7ca813c795a0777d4cdcb51b

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-06-25 Thread Nadav Amit via Virtualization
> On Jun 25, 2019, at 2:29 PM, Dave Hansen  wrote:
> 
> On 6/12/19 11:48 PM, Nadav Amit wrote:
>> To improve TLB shootdown performance, flush the remote and local TLBs
>> concurrently. Introduce flush_tlb_multi() that does so. The current
>> flush_tlb_others() interface is kept, since paravirtual interfaces need
>> to be adapted first before it can be removed. This is left for future
>> work. In such PV environments, TLB flushes are not performed, at this
>> time, concurrently.
>> 
>> Add a static key to tell whether this new interface is supported.
>> 
>> Cc: "K. Y. Srinivasan" 
>> Cc: Haiyang Zhang 
>> Cc: Stephen Hemminger 
>> Cc: Sasha Levin 
>> Cc: Thomas Gleixner 
>> Cc: Ingo Molnar 
>> Cc: Borislav Petkov 
>> Cc: x...@kernel.org
>> Cc: Juergen Gross 
>> Cc: Paolo Bonzini 
>> Cc: Dave Hansen 
>> Cc: Andy Lutomirski 
>> Cc: Peter Zijlstra 
>> Cc: Boris Ostrovsky 
>> Cc: linux-hyp...@vger.kernel.org
>> Cc: linux-ker...@vger.kernel.org
>> Cc: virtualization@lists.linux-foundation.org
>> Cc: k...@vger.kernel.org
>> Cc: xen-de...@lists.xenproject.org
>> Signed-off-by: Nadav Amit 
>> ---
>> arch/x86/hyperv/mmu.c |  2 +
>> arch/x86/include/asm/paravirt.h   |  8 +++
>> arch/x86/include/asm/paravirt_types.h |  6 +++
>> arch/x86/include/asm/tlbflush.h   |  6 +++
>> arch/x86/kernel/kvm.c |  1 +
>> arch/x86/kernel/paravirt.c|  3 ++
>> arch/x86/mm/tlb.c | 71 ++-
>> arch/x86/xen/mmu_pv.c |  2 +
>> 8 files changed, 87 insertions(+), 12 deletions(-)
>> 
>> diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
>> index e65d7fe6489f..ca28b400c87c 100644
>> --- a/arch/x86/hyperv/mmu.c
>> +++ b/arch/x86/hyperv/mmu.c
>> @@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
>>  pr_info("Using hypercall for remote TLB flush\n");
>>  pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
>>  pv_ops.mmu.tlb_remove_table = tlb_remove_table;
>> +
>> +static_key_disable(_tlb_multi_enabled.key);
>> }
>> diff --git a/arch/x86/include/asm/paravirt.h 
>> b/arch/x86/include/asm/paravirt.h
>> index c25c38a05c1c..192be7254457 100644
>> --- a/arch/x86/include/asm/paravirt.h
>> +++ b/arch/x86/include/asm/paravirt.h
>> @@ -47,6 +47,8 @@ static inline void slow_down_io(void)
>> #endif
>> }
>> 
>> +DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
>> +
>> static inline void __flush_tlb(void)
>> {
>>  PVOP_VCALL0(mmu.flush_tlb_user);
>> @@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long 
>> addr)
>>  PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
>> }
>> 
>> +static inline void flush_tlb_multi(const struct cpumask *cpumask,
>> +   const struct flush_tlb_info *info)
>> +{
>> +PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
>> +}
>> +
>> static inline void flush_tlb_others(const struct cpumask *cpumask,
>>  const struct flush_tlb_info *info)
>> {
>> diff --git a/arch/x86/include/asm/paravirt_types.h 
>> b/arch/x86/include/asm/paravirt_types.h
>> index 946f8f1f1efc..b93b3d90729a 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>  void (*flush_tlb_user)(void);
>>  void (*flush_tlb_kernel)(void);
>>  void (*flush_tlb_one_user)(unsigned long addr);
>> +/*
>> + * flush_tlb_multi() is the preferred interface, which is capable to
>> + * flush both local and remote CPUs.
>> + */
>> +void (*flush_tlb_multi)(const struct cpumask *cpus,
>> +const struct flush_tlb_info *info);
>>  void (*flush_tlb_others)(const struct cpumask *cpus,
>>   const struct flush_tlb_info *info);
>> 
>> diff --git a/arch/x86/include/asm/tlbflush.h 
>> b/arch/x86/include/asm/tlbflush.h
>> index dee375831962..79272938cf79 100644
>> --- a/arch/x86/include/asm/tlbflush.h
>> +++ b/arch/x86/include/asm/tlbflush.h
>> @@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct 
>> *vma, unsigned long a)
>>  flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
>> }
>> 
>> +void native_flush_tlb_multi(const struct cpumask *cpumask,
>> + const struct

[PATCH 4/9] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-06-13 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. The current
flush_tlb_others() interface is kept, since paravirtual interfaces need
to be adapted first before it can be removed. This is left for future
work. In such PV environments, TLB flushes are not performed, at this
time, concurrently.

Add a static key to tell whether this new interface is supported.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c |  2 +
 arch/x86/include/asm/paravirt.h   |  8 +++
 arch/x86/include/asm/paravirt_types.h |  6 +++
 arch/x86/include/asm/tlbflush.h   |  6 +++
 arch/x86/kernel/kvm.c |  1 +
 arch/x86/kernel/paravirt.c|  3 ++
 arch/x86/mm/tlb.c | 71 ++-
 arch/x86/xen/mmu_pv.c |  2 +
 8 files changed, 87 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..ca28b400c87c 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
pr_info("Using hypercall for remote TLB flush\n");
pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+
+   static_key_disable(_tlb_multi_enabled.key);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..192be7254457 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -47,6 +47,8 @@ static inline void slow_down_io(void)
 #endif
 }
 
+DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static inline void __flush_tlb(void)
 {
PVOP_VCALL0(mmu.flush_tlb_user);
@@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
+{
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
+}
+
 static inline void flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..b93b3d90729a 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,6 +211,12 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
+   /*
+* flush_tlb_multi() is the preferred interface, which is capable to
+* flush both local and remote CPUs.
+*/
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   const struct flush_tlb_info *info);
void (*flush_tlb_others)(const struct cpumask *cpus,
 const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..79272938cf79 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+const struct flush_tlb_info *info);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
@@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct 
arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
+#define flush_tlb_multi(mask, info)\
+   native_flush_tlb_multi(mask, info)
+
 #define flush_tlb_others(mask, info)   \
native_flush_tlb_others(mask, info)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 5169b8cc35bb..00d81e898717 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -630,6 +630,7 @@ static void __init kvm_guest_init(void)
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+   static_key_disable

Re: [RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-31 Thread Nadav Amit via Virtualization
> On May 31, 2019, at 4:48 AM, Juergen Gross  wrote:
> 
> On 31/05/2019 08:36, Nadav Amit wrote:
>> 
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>  void (*flush_tlb_user)(void);
>>  void (*flush_tlb_kernel)(void);
>>  void (*flush_tlb_one_user)(unsigned long addr);
>> +/*
>> + * flush_tlb_multi() is the preferred interface. When it is used,
>> + * flush_tlb_others() should return false.
> 
> Didn't you want to remove/change this comment?

Yes! Sorry for that. Fixed now.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH v2 04/12] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-31 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. The current
flush_tlb_others() interface is kept, since paravirtual interfaces need
to be adapted first before it can be removed. This is left for future
work. In such PV environments, TLB flushes are not performed, at this
time, concurrently.

Add a static key to tell whether this new interface is supported.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c |  2 +
 arch/x86/include/asm/paravirt.h   |  8 +++
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/tlbflush.h   |  6 ++
 arch/x86/kernel/kvm.c |  1 +
 arch/x86/kernel/paravirt.c|  3 +
 arch/x86/mm/tlb.c | 80 +++
 arch/x86/xen/mmu_pv.c |  2 +
 8 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..ca28b400c87c 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
pr_info("Using hypercall for remote TLB flush\n");
pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+
+   static_key_disable(_tlb_multi_enabled.key);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..192be7254457 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -47,6 +47,8 @@ static inline void slow_down_io(void)
 #endif
 }
 
+DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static inline void __flush_tlb(void)
 {
PVOP_VCALL0(mmu.flush_tlb_user);
@@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
+{
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
+}
+
 static inline void flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3a156e63c57d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,6 +211,12 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
+   /*
+* flush_tlb_multi() is the preferred interface. When it is used,
+* flush_tlb_others() should return false.
+*/
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   const struct flush_tlb_info *info);
void (*flush_tlb_others)(const struct cpumask *cpus,
 const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..79272938cf79 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+const struct flush_tlb_info *info);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
@@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct 
arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
+#define flush_tlb_multi(mask, info)\
+   native_flush_tlb_multi(mask, info)
+
 #define flush_tlb_others(mask, info)   \
native_flush_tlb_others(mask, info)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3f0cc828cc36..c1c2b88ea3f1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -643,6 +643,7 @@ static void __init kvm_guest_init(void)
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+   static_key_disable

Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-27 Thread Nadav Amit via Virtualization
> On May 27, 2019, at 2:47 AM, Peter Zijlstra  wrote:
> 
> On Sat, May 25, 2019 at 10:54:50AM +0200, Juergen Gross wrote:
>> On 25/05/2019 10:22, Nadav Amit wrote:
> 
>>> diff --git a/arch/x86/include/asm/paravirt_types.h 
>>> b/arch/x86/include/asm/paravirt_types.h
>>> index 946f8f1f1efc..3a156e63c57d 100644
>>> --- a/arch/x86/include/asm/paravirt_types.h
>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>> @@ -211,6 +211,12 @@ struct pv_mmu_ops {
>>> void (*flush_tlb_user)(void);
>>> void (*flush_tlb_kernel)(void);
>>> void (*flush_tlb_one_user)(unsigned long addr);
>>> +   /*
>>> +* flush_tlb_multi() is the preferred interface. When it is used,
>>> +* flush_tlb_others() should return false.
>> 
>> This comment does not make sense. flush_tlb_others() return type is
>> void.
> 
> I suspect that is an artifact from before the static_key; an attempt to
> make the pv interface less awkward.

Yes, remainders that should have been removed - I will remove them for the
next version.

> Something like the below would work for KVM I suspect, the others
> (Hyper-V and Xen are more 'interesting').
> 
> ---
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -580,7 +580,7 @@ static void __init kvm_apf_trap_init(voi
> 
> static DEFINE_PER_CPU(cpumask_var_t, __pv_tlb_mask);
> 
> -static void kvm_flush_tlb_others(const struct cpumask *cpumask,
> +static void kvm_flush_tlb_multi(const struct cpumask *cpumask,
>   const struct flush_tlb_info *info)
> {
>   u8 state;
> @@ -594,6 +594,9 @@ static void kvm_flush_tlb_others(const s
>* queue flush_on_enter for pre-empted vCPUs
>*/
>   for_each_cpu(cpu, flushmask) {
> + if (cpu == smp_processor_id())
> + continue;
> +
>   src = _cpu(steal_time, cpu);
>   state = READ_ONCE(src->preempted);
>   if ((state & KVM_VCPU_PREEMPTED)) {
> @@ -603,7 +606,7 @@ static void kvm_flush_tlb_others(const s
>   }
>   }
> 
> - native_flush_tlb_others(flushmask, info);
> + native_flush_tlb_multi(flushmask, info);
> }
> 
> static void __init kvm_guest_init(void)
> @@ -628,9 +631,8 @@ static void __init kvm_guest_init(void)
>   if (kvm_para_has_feature(KVM_FEATURE_PV_TLB_FLUSH) &&
>   !kvm_para_has_hint(KVM_HINTS_REALTIME) &&
>   kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
> - pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
> + pv_ops.mmu.flush_tlb_multi = kvm_flush_tlb_multi;
>   pv_ops.mmu.tlb_remove_table = tlb_remove_table;
> - static_key_disable(_tlb_multi_enabled.key);
>   }
> 
>   if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))

That’s what I have as well ;-).

As you mentioned (in another email), specifically hyper-v code seems
convoluted to me. In general, I prefer not to touch KVM/Xen/hyper-v, but you
twist my arm, I will send a compile-tested version for Xen and hyper-v.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-25 Thread Nadav Amit via Virtualization
> On May 25, 2019, at 1:22 AM, Nadav Amit  wrote:
> 
> To improve TLB shootdown performance, flush the remote and local TLBs
> concurrently. Introduce flush_tlb_multi() that does so. The current
> flush_tlb_others() interface is kept, since paravirtual interfaces need
> to be adapted first before it can be removed. This is left for future
> work. In such PV environments, TLB flushes are not performed, at this
> time, concurrently.
> 
> +void native_flush_tlb_multi(const struct cpumask *cpumask,
> + const struct flush_tlb_info *info)
> {
> + /*
> +  * native_flush_tlb_multi() can handle a single CPU, but it is
> +  * suboptimal if the local TLB should be flushed, and therefore should
> +  * not be used in such case. Check that it is not used in such case,
> +  * and use this assumption for tracing and accounting of remote TLB
> +  * flushes.
> +  */
> + VM_WARN_ON(!cpumask_any_but(cpumask, smp_processor_id()));

This warning might fire off incorrectly and will be removed.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 5/6] x86/mm/tlb: Flush remote and local TLBs concurrently

2019-05-25 Thread Nadav Amit via Virtualization
To improve TLB shootdown performance, flush the remote and local TLBs
concurrently. Introduce flush_tlb_multi() that does so. The current
flush_tlb_others() interface is kept, since paravirtual interfaces need
to be adapted first before it can be removed. This is left for future
work. In such PV environments, TLB flushes are not performed, at this
time, concurrently.

Add a static key to tell whether this new interface is supported.

Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Stephen Hemminger 
Cc: Sasha Levin 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: x...@kernel.org
Cc: Juergen Gross 
Cc: Paolo Bonzini 
Cc: Dave Hansen 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Cc: Boris Ostrovsky 
Cc: linux-hyp...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Cc: virtualization@lists.linux-foundation.org
Cc: k...@vger.kernel.org
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Nadav Amit 
---
 arch/x86/hyperv/mmu.c |  2 +
 arch/x86/include/asm/paravirt.h   |  8 +++
 arch/x86/include/asm/paravirt_types.h |  6 ++
 arch/x86/include/asm/tlbflush.h   |  6 ++
 arch/x86/kernel/kvm.c |  1 +
 arch/x86/kernel/paravirt.c|  3 +
 arch/x86/mm/tlb.c | 80 +++
 arch/x86/xen/mmu_pv.c |  2 +
 8 files changed, 96 insertions(+), 12 deletions(-)

diff --git a/arch/x86/hyperv/mmu.c b/arch/x86/hyperv/mmu.c
index e65d7fe6489f..ca28b400c87c 100644
--- a/arch/x86/hyperv/mmu.c
+++ b/arch/x86/hyperv/mmu.c
@@ -233,4 +233,6 @@ void hyperv_setup_mmu_ops(void)
pr_info("Using hypercall for remote TLB flush\n");
pv_ops.mmu.flush_tlb_others = hyperv_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+
+   static_key_disable(_tlb_multi_enabled.key);
 }
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..192be7254457 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -47,6 +47,8 @@ static inline void slow_down_io(void)
 #endif
 }
 
+DECLARE_STATIC_KEY_TRUE(flush_tlb_multi_enabled);
+
 static inline void __flush_tlb(void)
 {
PVOP_VCALL0(mmu.flush_tlb_user);
@@ -62,6 +64,12 @@ static inline void __flush_tlb_one_user(unsigned long addr)
PVOP_VCALL1(mmu.flush_tlb_one_user, addr);
 }
 
+static inline void flush_tlb_multi(const struct cpumask *cpumask,
+  const struct flush_tlb_info *info)
+{
+   PVOP_VCALL2(mmu.flush_tlb_multi, cpumask, info);
+}
+
 static inline void flush_tlb_others(const struct cpumask *cpumask,
const struct flush_tlb_info *info)
 {
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 946f8f1f1efc..3a156e63c57d 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -211,6 +211,12 @@ struct pv_mmu_ops {
void (*flush_tlb_user)(void);
void (*flush_tlb_kernel)(void);
void (*flush_tlb_one_user)(unsigned long addr);
+   /*
+* flush_tlb_multi() is the preferred interface. When it is used,
+* flush_tlb_others() should return false.
+*/
+   void (*flush_tlb_multi)(const struct cpumask *cpus,
+   const struct flush_tlb_info *info);
void (*flush_tlb_others)(const struct cpumask *cpus,
 const struct flush_tlb_info *info);
 
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index dee375831962..79272938cf79 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -569,6 +569,9 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
flush_tlb_mm_range(vma->vm_mm, a, a + PAGE_SIZE, PAGE_SHIFT, false);
 }
 
+void native_flush_tlb_multi(const struct cpumask *cpumask,
+const struct flush_tlb_info *info);
+
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
@@ -593,6 +596,9 @@ static inline void arch_tlbbatch_add_mm(struct 
arch_tlbflush_unmap_batch *batch,
 extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 
 #ifndef CONFIG_PARAVIRT
+#define flush_tlb_multi(mask, info)\
+   native_flush_tlb_multi(mask, info)
+
 #define flush_tlb_others(mask, info)   \
native_flush_tlb_others(mask, info)
 
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 3f0cc828cc36..c1c2b88ea3f1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -643,6 +643,7 @@ static void __init kvm_guest_init(void)
kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
pv_ops.mmu.flush_tlb_others = kvm_flush_tlb_others;
pv_ops.mmu.tlb_remove_table = tlb_remove_table;
+   static_key_disable

Re: [PATCH v4 0/4] vmw_balloon: Compaction and shrinker support

2019-05-17 Thread Nadav Amit
> On May 17, 2019, at 10:24 AM, Greg Kroah-Hartman  
> wrote:
> 
> On Fri, May 17, 2019 at 05:10:23PM +, Nadav Amit wrote:
>>> On May 3, 2019, at 6:25 PM, Nadav Amit  wrote:
>>> 
>>>> On Apr 25, 2019, at 4:54 AM, Nadav Amit  wrote:
>>>> 
>>>> VMware balloon enhancements: adding support for memory compaction,
>>>> memory shrinker (to prevent OOM) and splitting of refused pages to
>>>> prevent recurring inflations.
>>>> 
>>>> Patches 1-2: Support for compaction
>>>> Patch 3: Support for memory shrinker - disabled by default
>>>> Patch 4: Split refused pages to improve performance
>>>> 
>>>> v3->v4:
>>>> * "get around to" comment [Michael]
>>>> * Put list_add under page lock [Michael]
>>>> 
>>>> v2->v3:
>>>> * Fixing wrong argument type (int->size_t) [Michael]
>>>> * Fixing a comment (it) [Michael]
>>>> * Reinstating the BUG_ON() when page is locked [Michael] 
>>>> 
>>>> v1->v2:
>>>> * Return number of pages in list enqueue/dequeue interfaces [Michael]
>>>> * Removed first two patches which were already merged
>>>> 
>>>> Nadav Amit (4):
>>>> mm/balloon_compaction: List interfaces
>>>> vmw_balloon: Compaction support
>>>> vmw_balloon: Add memory shrinker
>>>> vmw_balloon: Split refused pages
>>>> 
>>>> drivers/misc/Kconfig   |   1 +
>>>> drivers/misc/vmw_balloon.c | 489 ++---
>>>> include/linux/balloon_compaction.h |   4 +
>>>> mm/balloon_compaction.c| 144 ++---
>>>> 4 files changed, 553 insertions(+), 85 deletions(-)
>>>> 
>>>> -- 
>>>> 2.19.1
>>> 
>>> Ping.
>> 
>> Ping.
>> 
>> Greg, did it got lost again?
> 
> 
> I thought you needed the mm developers to ack the first patch, did that
> ever happen?

Yes. You will see Michael Tsirkin’s “Acked-by" on it.



Re: [PATCH v4 0/4] vmw_balloon: Compaction and shrinker support

2019-05-17 Thread Nadav Amit
> On May 3, 2019, at 6:25 PM, Nadav Amit  wrote:
> 
>> On Apr 25, 2019, at 4:54 AM, Nadav Amit  wrote:
>> 
>> VMware balloon enhancements: adding support for memory compaction,
>> memory shrinker (to prevent OOM) and splitting of refused pages to
>> prevent recurring inflations.
>> 
>> Patches 1-2: Support for compaction
>> Patch 3: Support for memory shrinker - disabled by default
>> Patch 4: Split refused pages to improve performance
>> 
>> v3->v4:
>> * "get around to" comment [Michael]
>> * Put list_add under page lock [Michael]
>> 
>> v2->v3:
>> * Fixing wrong argument type (int->size_t) [Michael]
>> * Fixing a comment (it) [Michael]
>> * Reinstating the BUG_ON() when page is locked [Michael] 
>> 
>> v1->v2:
>> * Return number of pages in list enqueue/dequeue interfaces [Michael]
>> * Removed first two patches which were already merged
>> 
>> Nadav Amit (4):
>> mm/balloon_compaction: List interfaces
>> vmw_balloon: Compaction support
>> vmw_balloon: Add memory shrinker
>> vmw_balloon: Split refused pages
>> 
>> drivers/misc/Kconfig   |   1 +
>> drivers/misc/vmw_balloon.c | 489 ++---
>> include/linux/balloon_compaction.h |   4 +
>> mm/balloon_compaction.c| 144 ++---
>> 4 files changed, 553 insertions(+), 85 deletions(-)
>> 
>> -- 
>> 2.19.1
> 
> Ping.

Ping.

Greg, did it got lost again?



Re: [PATCH v4 0/4] vmw_balloon: Compaction and shrinker support

2019-05-03 Thread Nadav Amit via Virtualization
> On Apr 25, 2019, at 4:54 AM, Nadav Amit  wrote:
> 
> VMware balloon enhancements: adding support for memory compaction,
> memory shrinker (to prevent OOM) and splitting of refused pages to
> prevent recurring inflations.
> 
> Patches 1-2: Support for compaction
> Patch 3: Support for memory shrinker - disabled by default
> Patch 4: Split refused pages to improve performance
> 
> v3->v4:
> * "get around to" comment [Michael]
> * Put list_add under page lock [Michael]
> 
> v2->v3:
> * Fixing wrong argument type (int->size_t) [Michael]
> * Fixing a comment (it) [Michael]
> * Reinstating the BUG_ON() when page is locked [Michael] 
> 
> v1->v2:
> * Return number of pages in list enqueue/dequeue interfaces [Michael]
> * Removed first two patches which were already merged
> 
> Nadav Amit (4):
>  mm/balloon_compaction: List interfaces
>  vmw_balloon: Compaction support
>  vmw_balloon: Add memory shrinker
>  vmw_balloon: Split refused pages
> 
> drivers/misc/Kconfig   |   1 +
> drivers/misc/vmw_balloon.c | 489 ++---
> include/linux/balloon_compaction.h |   4 +
> mm/balloon_compaction.c| 144 ++---
> 4 files changed, 553 insertions(+), 85 deletions(-)
> 
> -- 
> 2.19.1

Ping.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 4/4] vmw_balloon: Split refused pages

2019-04-25 Thread Nadav Amit via Virtualization
The hypervisor might refuse to inflate pages. While the balloon driver
handles this scenario correctly, a refusal to inflate a 2MB pages might
cause the same page to be allocated again later just for its inflation
to be refused again. This wastes energy and time.

To avoid this situation, split the 2MB page to 4KB pages, and then try
to inflate each one individually. Most of the 4KB pages out of the 2MB
should be inflated successfully, and the balloon is likely to prevent
the scenario of repeated refused inflation.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/vmw_balloon.c | 63 +++---
 1 file changed, 52 insertions(+), 11 deletions(-)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 4b5e939ff4c8..043eed845246 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -239,6 +239,7 @@ static DEFINE_STATIC_KEY_FALSE(balloon_stat_enabled);
 struct vmballoon_ctl {
struct list_head pages;
struct list_head refused_pages;
+   struct list_head prealloc_pages;
unsigned int n_refused_pages;
unsigned int n_pages;
enum vmballoon_page_size_type page_size;
@@ -668,15 +669,25 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
unsigned int i;
 
for (i = 0; i < req_n_pages; i++) {
-   if (ctl->page_size == VMW_BALLOON_2M_PAGE)
-   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
+   /*
+* First check if we happen to have pages that were allocated
+* before. This happens when 2MB page rejected during inflation
+* by the hypervisor, and then split into 4KB pages.
+*/
+   if (!list_empty(>prealloc_pages)) {
+   page = list_first_entry(>prealloc_pages,
+   struct page, lru);
+   list_del(>lru);
+   } else {
+   if (ctl->page_size == VMW_BALLOON_2M_PAGE)
+   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
__GFP_NOMEMALLOC, VMW_BALLOON_2M_ORDER);
-   else
-   page = balloon_page_alloc();
+   else
+   page = balloon_page_alloc();
 
-   /* Update statistics */
-   vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
-ctl->page_size);
+   vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
+ctl->page_size);
+   }
 
if (page) {
vmballoon_mark_page_offline(page, ctl->page_size);
@@ -922,7 +933,8 @@ static void vmballoon_release_page_list(struct list_head 
*page_list,
__free_pages(page, vmballoon_page_order(page_size));
}
 
-   *n_pages = 0;
+   if (n_pages)
+   *n_pages = 0;
 }
 
 
@@ -1054,6 +1066,32 @@ static void vmballoon_dequeue_page_list(struct vmballoon 
*b,
*n_pages = i;
 }
 
+/**
+ * vmballoon_split_refused_pages() - Split the 2MB refused pages to 4k.
+ *
+ * If inflation of 2MB pages was denied by the hypervisor, it is likely to be
+ * due to one or few 4KB pages. These 2MB pages may keep being allocated and
+ * then being refused. To prevent this case, this function splits the refused
+ * pages into 4KB pages and adds them into @prealloc_pages list.
+ *
+ * @ctl: pointer for the %struct vmballoon_ctl, which defines the operation.
+ */
+static void vmballoon_split_refused_pages(struct vmballoon_ctl *ctl)
+{
+   struct page *page, *tmp;
+   unsigned int i, order;
+
+   order = vmballoon_page_order(ctl->page_size);
+
+   list_for_each_entry_safe(page, tmp, >refused_pages, lru) {
+   list_del(>lru);
+   split_page(page, order);
+   for (i = 0; i < (1 << order); i++)
+   list_add([i].lru, >prealloc_pages);
+   }
+   ctl->n_refused_pages = 0;
+}
+
 /**
  * vmballoon_inflate() - Inflate the balloon towards its target size.
  *
@@ -1065,6 +1103,7 @@ static void vmballoon_inflate(struct vmballoon *b)
struct vmballoon_ctl ctl = {
.pages = LIST_HEAD_INIT(ctl.pages),
.refused_pages = LIST_HEAD_INIT(ctl.refused_pages),
+   .prealloc_pages = LIST_HEAD_INIT(ctl.prealloc_pages),
.page_size = b->max_page_size,
.op = VMW_BALLOON_INFLATE
};
@@ -1112,10 +1151,10 @@ static void vmballoon_inflate(struct vmballoon *b)
break;
 
/*
-* Ignore errors from locking as we now switch to 4k
-* pages and we might get different errors.
+ 

[PATCH v4 2/4] vmw_balloon: Compaction support

2019-04-25 Thread Nadav Amit via Virtualization
Add support for compaction for VMware balloon. Since unlike the virtio
balloon, we also support huge-pages, which are not going through
compaction, we keep these pages in vmballoon and handle this list
separately. We use the same lock to protect both lists, as this lock is
not supposed to be contended.

Doing so also eliminates the need for the page_size lists. We update the
accounting as needed to reflect inflation, deflation and migration to be
reflected in vmstat.

Since VMware balloon now provides statistics for inflation, deflation
and migration in vmstat, select MEMORY_BALLOON in Kconfig.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/Kconfig   |   1 +
 drivers/misc/vmw_balloon.c | 301 -
 2 files changed, 264 insertions(+), 38 deletions(-)

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 42ab8ec92a04..427cf10579b4 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -420,6 +420,7 @@ config SPEAR13XX_PCIE_GADGET
 config VMWARE_BALLOON
tristate "VMware Balloon Driver"
depends on VMWARE_VMCI && X86 && HYPERVISOR_GUEST
+   select MEMORY_BALLOON
help
  This is VMware physical memory management driver which acts
  like a "balloon" that can be inflated to reclaim physical pages
diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index ad807d5a3141..2136f6ad97d3 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -28,6 +28,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -38,25 +40,11 @@ MODULE_ALIAS("dmi:*:svnVMware*:*");
 MODULE_ALIAS("vmware_vmmemctl");
 MODULE_LICENSE("GPL");
 
-/*
- * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We don't allow wait
- * (__GFP_RECLAIM) for huge page allocations. Use __GFP_NOWARN, to suppress 
page
- * allocation failure warnings. Disallow access to emergency low-memory pools.
- */
-#define VMW_HUGE_PAGE_ALLOC_FLAGS  (__GFP_HIGHMEM|__GFP_NOWARN|\
-__GFP_NOMEMALLOC)
-
-/*
- * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We allow lightweight
- * reclamation (__GFP_NORETRY). Use __GFP_NOWARN, to suppress page allocation
- * failure warnings. Disallow access to emergency low-memory pools.
- */
-#define VMW_PAGE_ALLOC_FLAGS   (__GFP_HIGHMEM|__GFP_NOWARN|\
-__GFP_NOMEMALLOC|__GFP_NORETRY)
-
-/* Maximum number of refused pages we accumulate during inflation cycle */
 #define VMW_BALLOON_MAX_REFUSED16
 
+/* Magic number for the balloon mount-point */
+#define BALLOON_VMW_MAGIC  0x0ba11007
+
 /*
  * Hypervisor communication port definitions.
  */
@@ -247,11 +235,6 @@ struct vmballoon_ctl {
enum vmballoon_op op;
 };
 
-struct vmballoon_page_size {
-   /* list of reserved physical pages */
-   struct list_head pages;
-};
-
 /**
  * struct vmballoon_batch_entry - a batch entry for lock or unlock.
  *
@@ -266,8 +249,6 @@ struct vmballoon_batch_entry {
 } __packed;
 
 struct vmballoon {
-   struct vmballoon_page_size page_sizes[VMW_BALLOON_NUM_PAGE_SIZES];
-
/**
 * @max_page_size: maximum supported page size for ballooning.
 *
@@ -348,8 +329,20 @@ struct vmballoon {
struct dentry *dbg_entry;
 #endif
 
+   /**
+* @b_dev_info: balloon device information descriptor.
+*/
+   struct balloon_dev_info b_dev_info;
+
struct delayed_work dwork;
 
+   /**
+* @huge_pages - list of the inflated 2MB pages.
+*
+* Protected by @b_dev_info.pages_lock .
+*/
+   struct list_head huge_pages;
+
/**
 * @vmci_doorbell.
 *
@@ -643,10 +636,10 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
 
for (i = 0; i < req_n_pages; i++) {
if (ctl->page_size == VMW_BALLOON_2M_PAGE)
-   page = alloc_pages(VMW_HUGE_PAGE_ALLOC_FLAGS,
-  VMW_BALLOON_2M_ORDER);
+   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
+   __GFP_NOMEMALLOC, VMW_BALLOON_2M_ORDER);
else
-   page = alloc_page(VMW_PAGE_ALLOC_FLAGS);
+   page = balloon_page_alloc();
 
/* Update statistics */
vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
@@ -961,9 +954,22 @@ static void vmballoon_enqueue_page_list(struct vmballoon 
*b,
unsigned int *n_pages,
enum vmballoon_page_size_type page_size)
 {
-   struct vmballoon_page_size *page_size_info = >page_sizes[page_size];
+   unsigned long flags;
+
+   if (page_size == VMW_BALLOON_4K_PAGE) {
+ 

[PATCH v4 3/4] vmw_balloon: Add memory shrinker

2019-04-25 Thread Nadav Amit via Virtualization
Add a shrinker to the VMware balloon to prevent out-of-memory events.
We reuse the deflate logic for this matter. Deadlocks should not happen,
as no memory allocation is performed while the locks of the
communication (batch/page) and page-list are taken. In the unlikely
event in which the configuration semaphore is taken for write we bail
out and fail gracefully (causing processes to be killed).

Once the shrinker is called, inflation is postponed for few seconds.
The timeout is updated without any lock, but this should not cause any
races, as it is written and read atomically.

This feature is disabled by default, since it might cause performance
degradation.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/vmw_balloon.c | 133 -
 1 file changed, 131 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 2136f6ad97d3..4b5e939ff4c8 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -40,6 +40,15 @@ MODULE_ALIAS("dmi:*:svnVMware*:*");
 MODULE_ALIAS("vmware_vmmemctl");
 MODULE_LICENSE("GPL");
 
+static bool __read_mostly vmwballoon_shrinker_enable;
+module_param(vmwballoon_shrinker_enable, bool, 0444);
+MODULE_PARM_DESC(vmwballoon_shrinker_enable,
+   "Enable non-cooperative out-of-memory protection. Disabled by default 
as it may degrade performance.");
+
+/* Delay in seconds after shrink before inflation. */
+#define VMBALLOON_SHRINK_DELAY (5)
+
+/* Maximum number of refused pages we accumulate during inflation cycle */
 #define VMW_BALLOON_MAX_REFUSED16
 
 /* Magic number for the balloon mount-point */
@@ -217,12 +226,13 @@ enum vmballoon_stat_general {
VMW_BALLOON_STAT_TIMER,
VMW_BALLOON_STAT_DOORBELL,
VMW_BALLOON_STAT_RESET,
-   VMW_BALLOON_STAT_LAST = VMW_BALLOON_STAT_RESET
+   VMW_BALLOON_STAT_SHRINK,
+   VMW_BALLOON_STAT_SHRINK_FREE,
+   VMW_BALLOON_STAT_LAST = VMW_BALLOON_STAT_SHRINK_FREE
 };
 
 #define VMW_BALLOON_STAT_NUM   (VMW_BALLOON_STAT_LAST + 1)
 
-
 static DEFINE_STATIC_KEY_TRUE(vmw_balloon_batching);
 static DEFINE_STATIC_KEY_FALSE(balloon_stat_enabled);
 
@@ -321,6 +331,15 @@ struct vmballoon {
 */
struct page *page;
 
+   /**
+* @shrink_timeout: timeout until the next inflation.
+*
+* After an shrink event, indicates the time in jiffies after which
+* inflation is allowed again. Can be written concurrently with reads,
+* so must use READ_ONCE/WRITE_ONCE when accessing.
+*/
+   unsigned long shrink_timeout;
+
/* statistics */
struct vmballoon_stats *stats;
 
@@ -361,6 +380,20 @@ struct vmballoon {
 * Lock ordering: @conf_sem -> @comm_lock .
 */
spinlock_t comm_lock;
+
+   /**
+* @shrinker: shrinker interface that is used to avoid over-inflation.
+*/
+   struct shrinker shrinker;
+
+   /**
+* @shrinker_registered: whether the shrinker was registered.
+*
+* The shrinker interface does not handle gracefully the removal of
+* shrinker that was not registered before. This indication allows to
+* simplify the unregistration process.
+*/
+   bool shrinker_registered;
 };
 
 static struct vmballoon balloon;
@@ -935,6 +968,10 @@ static int64_t vmballoon_change(struct vmballoon *b)
size - target < vmballoon_page_in_frames(VMW_BALLOON_2M_PAGE))
return 0;
 
+   /* If an out-of-memory recently occurred, inflation is disallowed. */
+   if (target > size && time_before(jiffies, READ_ONCE(b->shrink_timeout)))
+   return 0;
+
return target - size;
 }
 
@@ -1430,6 +1467,90 @@ static void vmballoon_work(struct work_struct *work)
 
 }
 
+/**
+ * vmballoon_shrinker_scan() - deflate the balloon due to memory pressure.
+ * @shrinker: pointer to the balloon shrinker.
+ * @sc: page reclaim information.
+ *
+ * Returns: number of pages that were freed during deflation.
+ */
+static unsigned long vmballoon_shrinker_scan(struct shrinker *shrinker,
+struct shrink_control *sc)
+{
+   struct vmballoon *b = 
+   unsigned long deflated_frames;
+
+   pr_debug("%s - size: %llu", __func__, atomic64_read(>size));
+
+   vmballoon_stats_gen_inc(b, VMW_BALLOON_STAT_SHRINK);
+
+   /*
+* If the lock is also contended for read, we cannot easily reclaim and
+* we bail out.
+*/
+   if (!down_read_trylock(>conf_sem))
+   return 0;
+
+   deflated_frames = vmballoon_deflate(b, sc->nr_to_scan, true);
+
+   vmballoon_stats_gen_add(b, VMW_BALLOON_STAT_SHRINK_FREE,
+   deflated_frames);
+
+   /*
+* Delay future inflation for some time to mitigate the situat

[PATCH v4 1/4] mm/balloon_compaction: List interfaces

2019-04-25 Thread Nadav Amit via Virtualization
Introduce interfaces for ballooning enqueueing and dequeueing of a list
of pages. These interfaces reduce the overhead of storing and restoring
IRQs by batching the operations. In addition they do not panic if the
list of pages is empty.

Cc: Jason Wang 
Cc: linux...@kvack.org
Cc: virtualization@lists.linux-foundation.org
Acked-by: Michael S. Tsirkin 
Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 include/linux/balloon_compaction.h |   4 +
 mm/balloon_compaction.c| 144 +
 2 files changed, 110 insertions(+), 38 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index f111c780ef1d..430b6047cef7 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -64,6 +64,10 @@ extern struct page *balloon_page_alloc(void);
 extern void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
 struct page *page);
 extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info);
+extern size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
+ struct list_head *pages);
+extern size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages, size_t 
n_req_pages);
 
 static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 {
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index ef858d547e2d..b7bd72612c5a 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -10,6 +10,105 @@
 #include 
 #include 
 
+static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info,
+struct page *page)
+{
+   /*
+* Block others from accessing the 'page' when we get around to
+* establishing additional references. We should be the only one
+* holding a reference to the 'page' at this point. If we are not, then
+* memory corruption is possible and we should stop execution.
+*/
+   BUG_ON(!trylock_page(page));
+   list_del(>lru);
+   balloon_page_insert(b_dev_info, page);
+   unlock_page(page);
+   __count_vm_event(BALLOON_INFLATE);
+}
+
+/**
+ * balloon_page_list_enqueue() - inserts a list of pages into the balloon page
+ *  list.
+ * @b_dev_info: balloon device descriptor where we will insert a new page to
+ * @pages: pages to enqueue - allocated using balloon_page_alloc.
+ *
+ * Driver must call it to properly enqueue a balloon pages before definitively
+ * removing it from the guest system.
+ *
+ * Return: number of pages that were enqueued.
+ */
+size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages)
+{
+   struct page *page, *tmp;
+   unsigned long flags;
+   size_t n_pages = 0;
+
+   spin_lock_irqsave(_dev_info->pages_lock, flags);
+   list_for_each_entry_safe(page, tmp, pages, lru) {
+   balloon_page_enqueue_one(b_dev_info, page);
+   n_pages++;
+   }
+   spin_unlock_irqrestore(_dev_info->pages_lock, flags);
+   return n_pages;
+}
+EXPORT_SYMBOL_GPL(balloon_page_list_enqueue);
+
+/**
+ * balloon_page_list_dequeue() - removes pages from balloon's page list and
+ *  returns a list of the pages.
+ * @b_dev_info: balloon device decriptor where we will grab a page from.
+ * @pages: pointer to the list of pages that would be returned to the caller.
+ * @n_req_pages: number of requested pages.
+ *
+ * Driver must call this function to properly de-allocate a previous enlisted
+ * balloon pages before definetively releasing it back to the guest system.
+ * This function tries to remove @n_req_pages from the ballooned pages and
+ * return them to the caller in the @pages list.
+ *
+ * Note that this function may fail to dequeue some pages temporarily empty due
+ * to compaction isolated pages.
+ *
+ * Return: number of pages that were added to the @pages list.
+ */
+size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages, size_t n_req_pages)
+{
+   struct page *page, *tmp;
+   unsigned long flags;
+   size_t n_pages = 0;
+
+   spin_lock_irqsave(_dev_info->pages_lock, flags);
+   list_for_each_entry_safe(page, tmp, _dev_info->pages, lru) {
+   if (n_pages == n_req_pages)
+   break;
+
+   /*
+* Block others from accessing the 'page' while we get around to
+* establishing additional references and preparing the 'page'
+* to be released by the balloon driver.
+*/
+   if (!trylock_page(page))
+   continue;
+
+   if (IS_ENABLED(CONFIG_BAL

[PATCH v4 0/4] vmw_balloon: Compaction and shrinker support

2019-04-25 Thread Nadav Amit via Virtualization
VMware balloon enhancements: adding support for memory compaction,
memory shrinker (to prevent OOM) and splitting of refused pages to
prevent recurring inflations.

Patches 1-2: Support for compaction
Patch 3: Support for memory shrinker - disabled by default
Patch 4: Split refused pages to improve performance

v3->v4:
* "get around to" comment [Michael]
* Put list_add under page lock [Michael]

v2->v3:
* Fixing wrong argument type (int->size_t) [Michael]
* Fixing a comment (it) [Michael]
* Reinstating the BUG_ON() when page is locked [Michael] 

v1->v2:
* Return number of pages in list enqueue/dequeue interfaces [Michael]
* Removed first two patches which were already merged

Nadav Amit (4):
  mm/balloon_compaction: List interfaces
  vmw_balloon: Compaction support
  vmw_balloon: Add memory shrinker
  vmw_balloon: Split refused pages

 drivers/misc/Kconfig   |   1 +
 drivers/misc/vmw_balloon.c | 489 ++---
 include/linux/balloon_compaction.h |   4 +
 mm/balloon_compaction.c| 144 ++---
 4 files changed, 553 insertions(+), 85 deletions(-)

-- 
2.19.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 1/4] mm/balloon_compaction: list interfaces

2019-04-25 Thread Nadav Amit via Virtualization
> On Apr 24, 2019, at 6:49 AM, Michael S. Tsirkin  wrote:
> 
> On Tue, Apr 23, 2019 at 04:45:28PM -0700, Nadav Amit wrote:
>> Introduce interfaces for ballooning enqueueing and dequeueing of a list
>> of pages. These interfaces reduce the overhead of storing and restoring
>> IRQs by batching the operations. In addition they do not panic if the
>> list of pages is empty.
>> 
>> Cc: "Michael S. Tsirkin" 
>> Cc: Jason Wang 
>> Cc: linux...@kvack.org
>> Cc: virtualization@lists.linux-foundation.org
>> Reviewed-by: Xavier Deguillard 
>> Signed-off-by: Nadav Amit 
> 
> 
> Looks good overall. Two minor comments below.
> 
> 
>> ---
>> include/linux/balloon_compaction.h |   4 +
>> mm/balloon_compaction.c| 144 +
>> 2 files changed, 110 insertions(+), 38 deletions(-)
>> 
>> diff --git a/include/linux/balloon_compaction.h 
>> b/include/linux/balloon_compaction.h
>> index f111c780ef1d..430b6047cef7 100644
>> --- a/include/linux/balloon_compaction.h
>> +++ b/include/linux/balloon_compaction.h
>> @@ -64,6 +64,10 @@ extern struct page *balloon_page_alloc(void);
>> extern void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
>>   struct page *page);
>> extern struct page *balloon_page_dequeue(struct balloon_dev_info 
>> *b_dev_info);
>> +extern size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
>> +  struct list_head *pages);
>> +extern size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
>> + struct list_head *pages, size_t 
>> n_req_pages);
>> 
>> static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
>> {
>> diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
>> index ef858d547e2d..a2995002edc2 100644
>> --- a/mm/balloon_compaction.c
>> +++ b/mm/balloon_compaction.c
>> @@ -10,6 +10,105 @@
>> #include 
>> #include 
>> 
>> +static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info,
>> + struct page *page)
>> +{
>> +/*
>> + * Block others from accessing the 'page' when we get around to
>> + * establishing additional references. We should be the only one
>> + * holding a reference to the 'page' at this point. If we are not, then
>> + * memory corruption is possible and we should stop execution.
>> + */
>> +BUG_ON(!trylock_page(page));
>> +list_del(>lru);
>> +balloon_page_insert(b_dev_info, page);
>> +unlock_page(page);
>> +__count_vm_event(BALLOON_INFLATE);
>> +}
>> +
>> +/**
>> + * balloon_page_list_enqueue() - inserts a list of pages into the balloon 
>> page
>> + *   list.
>> + * @b_dev_info: balloon device descriptor where we will insert a new page to
>> + * @pages: pages to enqueue - allocated using balloon_page_alloc.
>> + *
>> + * Driver must call it to properly enqueue a balloon pages before 
>> definitively
>> + * removing it from the guest system.
>> + *
>> + * Return: number of pages that were enqueued.
>> + */
>> +size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
>> + struct list_head *pages)
>> +{
>> +struct page *page, *tmp;
>> +unsigned long flags;
>> +size_t n_pages = 0;
>> +
>> +spin_lock_irqsave(_dev_info->pages_lock, flags);
>> +list_for_each_entry_safe(page, tmp, pages, lru) {
>> +balloon_page_enqueue_one(b_dev_info, page);
>> +n_pages++;
>> +}
>> +spin_unlock_irqrestore(_dev_info->pages_lock, flags);
>> +return n_pages;
>> +}
>> +EXPORT_SYMBOL_GPL(balloon_page_list_enqueue);
>> +
>> +/**
>> + * balloon_page_list_dequeue() - removes pages from balloon's page list and
>> + *   returns a list of the pages.
>> + * @b_dev_info: balloon device decriptor where we will grab a page from.
>> + * @pages: pointer to the list of pages that would be returned to the 
>> caller.
>> + * @n_req_pages: number of requested pages.
>> + *
>> + * Driver must call this function to properly de-allocate a previous 
>> enlisted
>> + * balloon pages before definetively releasing it back to the guest system.
>> + * This function tries to remove @n_req_pages from the ballooned pages and
>> + * return them to the caller in the

[PATCH v3 1/4] mm/balloon_compaction: list interfaces

2019-04-24 Thread Nadav Amit via Virtualization
Introduce interfaces for ballooning enqueueing and dequeueing of a list
of pages. These interfaces reduce the overhead of storing and restoring
IRQs by batching the operations. In addition they do not panic if the
list of pages is empty.

Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: linux...@kvack.org
Cc: virtualization@lists.linux-foundation.org
Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 include/linux/balloon_compaction.h |   4 +
 mm/balloon_compaction.c| 144 +
 2 files changed, 110 insertions(+), 38 deletions(-)

diff --git a/include/linux/balloon_compaction.h 
b/include/linux/balloon_compaction.h
index f111c780ef1d..430b6047cef7 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -64,6 +64,10 @@ extern struct page *balloon_page_alloc(void);
 extern void balloon_page_enqueue(struct balloon_dev_info *b_dev_info,
 struct page *page);
 extern struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info);
+extern size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
+ struct list_head *pages);
+extern size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages, size_t 
n_req_pages);
 
 static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 {
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index ef858d547e2d..a2995002edc2 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -10,6 +10,105 @@
 #include 
 #include 
 
+static void balloon_page_enqueue_one(struct balloon_dev_info *b_dev_info,
+struct page *page)
+{
+   /*
+* Block others from accessing the 'page' when we get around to
+* establishing additional references. We should be the only one
+* holding a reference to the 'page' at this point. If we are not, then
+* memory corruption is possible and we should stop execution.
+*/
+   BUG_ON(!trylock_page(page));
+   list_del(>lru);
+   balloon_page_insert(b_dev_info, page);
+   unlock_page(page);
+   __count_vm_event(BALLOON_INFLATE);
+}
+
+/**
+ * balloon_page_list_enqueue() - inserts a list of pages into the balloon page
+ *  list.
+ * @b_dev_info: balloon device descriptor where we will insert a new page to
+ * @pages: pages to enqueue - allocated using balloon_page_alloc.
+ *
+ * Driver must call it to properly enqueue a balloon pages before definitively
+ * removing it from the guest system.
+ *
+ * Return: number of pages that were enqueued.
+ */
+size_t balloon_page_list_enqueue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages)
+{
+   struct page *page, *tmp;
+   unsigned long flags;
+   size_t n_pages = 0;
+
+   spin_lock_irqsave(_dev_info->pages_lock, flags);
+   list_for_each_entry_safe(page, tmp, pages, lru) {
+   balloon_page_enqueue_one(b_dev_info, page);
+   n_pages++;
+   }
+   spin_unlock_irqrestore(_dev_info->pages_lock, flags);
+   return n_pages;
+}
+EXPORT_SYMBOL_GPL(balloon_page_list_enqueue);
+
+/**
+ * balloon_page_list_dequeue() - removes pages from balloon's page list and
+ *  returns a list of the pages.
+ * @b_dev_info: balloon device decriptor where we will grab a page from.
+ * @pages: pointer to the list of pages that would be returned to the caller.
+ * @n_req_pages: number of requested pages.
+ *
+ * Driver must call this function to properly de-allocate a previous enlisted
+ * balloon pages before definetively releasing it back to the guest system.
+ * This function tries to remove @n_req_pages from the ballooned pages and
+ * return them to the caller in the @pages list.
+ *
+ * Note that this function may fail to dequeue some pages temporarily empty due
+ * to compaction isolated pages.
+ *
+ * Return: number of pages that were added to the @pages list.
+ */
+size_t balloon_page_list_dequeue(struct balloon_dev_info *b_dev_info,
+struct list_head *pages, size_t n_req_pages)
+{
+   struct page *page, *tmp;
+   unsigned long flags;
+   size_t n_pages = 0;
+
+   spin_lock_irqsave(_dev_info->pages_lock, flags);
+   list_for_each_entry_safe(page, tmp, _dev_info->pages, lru) {
+   if (n_pages == n_req_pages)
+   break;
+
+   /*
+* Block others from accessing the 'page' while we get around
+* establishing additional references and preparing the 'page'
+* to be released by the balloon driver.
+*/
+   if (!trylock_page(page))
+   continue;
+
+   if (IS_ENABLED(

[PATCH v3 2/4] vmw_balloon: compaction support

2019-04-24 Thread Nadav Amit via Virtualization
Add support for compaction for VMware balloon. Since unlike the virtio
balloon, we also support huge-pages, which are not going through
compaction, we keep these pages in vmballoon and handle this list
separately. We use the same lock to protect both lists, as this lock is
not supposed to be contended.

Doing so also eliminates the need for the page_size lists. We update the
accounting as needed to reflect inflation, deflation and migration to be
reflected in vmstat.

Since VMware balloon now provides statistics for inflation, deflation
and migration in vmstat, select MEMORY_BALLOON in Kconfig.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/Kconfig   |   1 +
 drivers/misc/vmw_balloon.c | 301 -
 2 files changed, 264 insertions(+), 38 deletions(-)

diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 42ab8ec92a04..427cf10579b4 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -420,6 +420,7 @@ config SPEAR13XX_PCIE_GADGET
 config VMWARE_BALLOON
tristate "VMware Balloon Driver"
depends on VMWARE_VMCI && X86 && HYPERVISOR_GUEST
+   select MEMORY_BALLOON
help
  This is VMware physical memory management driver which acts
  like a "balloon" that can be inflated to reclaim physical pages
diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index ad807d5a3141..2136f6ad97d3 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -28,6 +28,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -38,25 +40,11 @@ MODULE_ALIAS("dmi:*:svnVMware*:*");
 MODULE_ALIAS("vmware_vmmemctl");
 MODULE_LICENSE("GPL");
 
-/*
- * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We don't allow wait
- * (__GFP_RECLAIM) for huge page allocations. Use __GFP_NOWARN, to suppress 
page
- * allocation failure warnings. Disallow access to emergency low-memory pools.
- */
-#define VMW_HUGE_PAGE_ALLOC_FLAGS  (__GFP_HIGHMEM|__GFP_NOWARN|\
-__GFP_NOMEMALLOC)
-
-/*
- * Use __GFP_HIGHMEM to allow pages from HIGHMEM zone. We allow lightweight
- * reclamation (__GFP_NORETRY). Use __GFP_NOWARN, to suppress page allocation
- * failure warnings. Disallow access to emergency low-memory pools.
- */
-#define VMW_PAGE_ALLOC_FLAGS   (__GFP_HIGHMEM|__GFP_NOWARN|\
-__GFP_NOMEMALLOC|__GFP_NORETRY)
-
-/* Maximum number of refused pages we accumulate during inflation cycle */
 #define VMW_BALLOON_MAX_REFUSED16
 
+/* Magic number for the balloon mount-point */
+#define BALLOON_VMW_MAGIC  0x0ba11007
+
 /*
  * Hypervisor communication port definitions.
  */
@@ -247,11 +235,6 @@ struct vmballoon_ctl {
enum vmballoon_op op;
 };
 
-struct vmballoon_page_size {
-   /* list of reserved physical pages */
-   struct list_head pages;
-};
-
 /**
  * struct vmballoon_batch_entry - a batch entry for lock or unlock.
  *
@@ -266,8 +249,6 @@ struct vmballoon_batch_entry {
 } __packed;
 
 struct vmballoon {
-   struct vmballoon_page_size page_sizes[VMW_BALLOON_NUM_PAGE_SIZES];
-
/**
 * @max_page_size: maximum supported page size for ballooning.
 *
@@ -348,8 +329,20 @@ struct vmballoon {
struct dentry *dbg_entry;
 #endif
 
+   /**
+* @b_dev_info: balloon device information descriptor.
+*/
+   struct balloon_dev_info b_dev_info;
+
struct delayed_work dwork;
 
+   /**
+* @huge_pages - list of the inflated 2MB pages.
+*
+* Protected by @b_dev_info.pages_lock .
+*/
+   struct list_head huge_pages;
+
/**
 * @vmci_doorbell.
 *
@@ -643,10 +636,10 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
 
for (i = 0; i < req_n_pages; i++) {
if (ctl->page_size == VMW_BALLOON_2M_PAGE)
-   page = alloc_pages(VMW_HUGE_PAGE_ALLOC_FLAGS,
-  VMW_BALLOON_2M_ORDER);
+   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
+   __GFP_NOMEMALLOC, VMW_BALLOON_2M_ORDER);
else
-   page = alloc_page(VMW_PAGE_ALLOC_FLAGS);
+   page = balloon_page_alloc();
 
/* Update statistics */
vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
@@ -961,9 +954,22 @@ static void vmballoon_enqueue_page_list(struct vmballoon 
*b,
unsigned int *n_pages,
enum vmballoon_page_size_type page_size)
 {
-   struct vmballoon_page_size *page_size_info = >page_sizes[page_size];
+   unsigned long flags;
+
+   if (page_size == VMW_BALLOON_4K_PAGE) {
+ 

[PATCH v3 3/4] vmw_balloon: add memory shrinker

2019-04-24 Thread Nadav Amit via Virtualization
Add a shrinker to the VMware balloon to prevent out-of-memory events.
We reuse the deflate logic for this matter. Deadlocks should not happen,
as no memory allocation is performed while the locks of the
communication (batch/page) and page-list are taken. In the unlikely
event in which the configuration semaphore is taken for write we bail
out and fail gracefully (causing processes to be killed).

Once the shrinker is called, inflation is postponed for few seconds.
The timeout is updated without any lock, but this should not cause any
races, as it is written and read atomically.

This feature is disabled by default, since it might cause performance
degradation.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/vmw_balloon.c | 133 -
 1 file changed, 131 insertions(+), 2 deletions(-)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 2136f6ad97d3..4b5e939ff4c8 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -40,6 +40,15 @@ MODULE_ALIAS("dmi:*:svnVMware*:*");
 MODULE_ALIAS("vmware_vmmemctl");
 MODULE_LICENSE("GPL");
 
+static bool __read_mostly vmwballoon_shrinker_enable;
+module_param(vmwballoon_shrinker_enable, bool, 0444);
+MODULE_PARM_DESC(vmwballoon_shrinker_enable,
+   "Enable non-cooperative out-of-memory protection. Disabled by default 
as it may degrade performance.");
+
+/* Delay in seconds after shrink before inflation. */
+#define VMBALLOON_SHRINK_DELAY (5)
+
+/* Maximum number of refused pages we accumulate during inflation cycle */
 #define VMW_BALLOON_MAX_REFUSED16
 
 /* Magic number for the balloon mount-point */
@@ -217,12 +226,13 @@ enum vmballoon_stat_general {
VMW_BALLOON_STAT_TIMER,
VMW_BALLOON_STAT_DOORBELL,
VMW_BALLOON_STAT_RESET,
-   VMW_BALLOON_STAT_LAST = VMW_BALLOON_STAT_RESET
+   VMW_BALLOON_STAT_SHRINK,
+   VMW_BALLOON_STAT_SHRINK_FREE,
+   VMW_BALLOON_STAT_LAST = VMW_BALLOON_STAT_SHRINK_FREE
 };
 
 #define VMW_BALLOON_STAT_NUM   (VMW_BALLOON_STAT_LAST + 1)
 
-
 static DEFINE_STATIC_KEY_TRUE(vmw_balloon_batching);
 static DEFINE_STATIC_KEY_FALSE(balloon_stat_enabled);
 
@@ -321,6 +331,15 @@ struct vmballoon {
 */
struct page *page;
 
+   /**
+* @shrink_timeout: timeout until the next inflation.
+*
+* After an shrink event, indicates the time in jiffies after which
+* inflation is allowed again. Can be written concurrently with reads,
+* so must use READ_ONCE/WRITE_ONCE when accessing.
+*/
+   unsigned long shrink_timeout;
+
/* statistics */
struct vmballoon_stats *stats;
 
@@ -361,6 +380,20 @@ struct vmballoon {
 * Lock ordering: @conf_sem -> @comm_lock .
 */
spinlock_t comm_lock;
+
+   /**
+* @shrinker: shrinker interface that is used to avoid over-inflation.
+*/
+   struct shrinker shrinker;
+
+   /**
+* @shrinker_registered: whether the shrinker was registered.
+*
+* The shrinker interface does not handle gracefully the removal of
+* shrinker that was not registered before. This indication allows to
+* simplify the unregistration process.
+*/
+   bool shrinker_registered;
 };
 
 static struct vmballoon balloon;
@@ -935,6 +968,10 @@ static int64_t vmballoon_change(struct vmballoon *b)
size - target < vmballoon_page_in_frames(VMW_BALLOON_2M_PAGE))
return 0;
 
+   /* If an out-of-memory recently occurred, inflation is disallowed. */
+   if (target > size && time_before(jiffies, READ_ONCE(b->shrink_timeout)))
+   return 0;
+
return target - size;
 }
 
@@ -1430,6 +1467,90 @@ static void vmballoon_work(struct work_struct *work)
 
 }
 
+/**
+ * vmballoon_shrinker_scan() - deflate the balloon due to memory pressure.
+ * @shrinker: pointer to the balloon shrinker.
+ * @sc: page reclaim information.
+ *
+ * Returns: number of pages that were freed during deflation.
+ */
+static unsigned long vmballoon_shrinker_scan(struct shrinker *shrinker,
+struct shrink_control *sc)
+{
+   struct vmballoon *b = 
+   unsigned long deflated_frames;
+
+   pr_debug("%s - size: %llu", __func__, atomic64_read(>size));
+
+   vmballoon_stats_gen_inc(b, VMW_BALLOON_STAT_SHRINK);
+
+   /*
+* If the lock is also contended for read, we cannot easily reclaim and
+* we bail out.
+*/
+   if (!down_read_trylock(>conf_sem))
+   return 0;
+
+   deflated_frames = vmballoon_deflate(b, sc->nr_to_scan, true);
+
+   vmballoon_stats_gen_add(b, VMW_BALLOON_STAT_SHRINK_FREE,
+   deflated_frames);
+
+   /*
+* Delay future inflation for some time to mitigate the situat

[PATCH v3 4/4] vmw_balloon: split refused pages

2019-04-24 Thread Nadav Amit via Virtualization
The hypervisor might refuse to inflate pages. While the balloon driver
handles this scenario correctly, a refusal to inflate a 2MB pages might
cause the same page to be allocated again later just for its inflation
to be refused again. This wastes energy and time.

To avoid this situation, split the 2MB page to 4KB pages, and then try
to inflate each one individually. Most of the 4KB pages out of the 2MB
should be inflated successfully, and the balloon is likely to prevent
the scenario of repeated refused inflation.

Reviewed-by: Xavier Deguillard 
Signed-off-by: Nadav Amit 
---
 drivers/misc/vmw_balloon.c | 63 +++---
 1 file changed, 52 insertions(+), 11 deletions(-)

diff --git a/drivers/misc/vmw_balloon.c b/drivers/misc/vmw_balloon.c
index 4b5e939ff4c8..043eed845246 100644
--- a/drivers/misc/vmw_balloon.c
+++ b/drivers/misc/vmw_balloon.c
@@ -239,6 +239,7 @@ static DEFINE_STATIC_KEY_FALSE(balloon_stat_enabled);
 struct vmballoon_ctl {
struct list_head pages;
struct list_head refused_pages;
+   struct list_head prealloc_pages;
unsigned int n_refused_pages;
unsigned int n_pages;
enum vmballoon_page_size_type page_size;
@@ -668,15 +669,25 @@ static int vmballoon_alloc_page_list(struct vmballoon *b,
unsigned int i;
 
for (i = 0; i < req_n_pages; i++) {
-   if (ctl->page_size == VMW_BALLOON_2M_PAGE)
-   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
+   /*
+* First check if we happen to have pages that were allocated
+* before. This happens when 2MB page rejected during inflation
+* by the hypervisor, and then split into 4KB pages.
+*/
+   if (!list_empty(>prealloc_pages)) {
+   page = list_first_entry(>prealloc_pages,
+   struct page, lru);
+   list_del(>lru);
+   } else {
+   if (ctl->page_size == VMW_BALLOON_2M_PAGE)
+   page = alloc_pages(__GFP_HIGHMEM|__GFP_NOWARN|
__GFP_NOMEMALLOC, VMW_BALLOON_2M_ORDER);
-   else
-   page = balloon_page_alloc();
+   else
+   page = balloon_page_alloc();
 
-   /* Update statistics */
-   vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
-ctl->page_size);
+   vmballoon_stats_page_inc(b, VMW_BALLOON_PAGE_STAT_ALLOC,
+ctl->page_size);
+   }
 
if (page) {
vmballoon_mark_page_offline(page, ctl->page_size);
@@ -922,7 +933,8 @@ static void vmballoon_release_page_list(struct list_head 
*page_list,
__free_pages(page, vmballoon_page_order(page_size));
}
 
-   *n_pages = 0;
+   if (n_pages)
+   *n_pages = 0;
 }
 
 
@@ -1054,6 +1066,32 @@ static void vmballoon_dequeue_page_list(struct vmballoon 
*b,
*n_pages = i;
 }
 
+/**
+ * vmballoon_split_refused_pages() - Split the 2MB refused pages to 4k.
+ *
+ * If inflation of 2MB pages was denied by the hypervisor, it is likely to be
+ * due to one or few 4KB pages. These 2MB pages may keep being allocated and
+ * then being refused. To prevent this case, this function splits the refused
+ * pages into 4KB pages and adds them into @prealloc_pages list.
+ *
+ * @ctl: pointer for the %struct vmballoon_ctl, which defines the operation.
+ */
+static void vmballoon_split_refused_pages(struct vmballoon_ctl *ctl)
+{
+   struct page *page, *tmp;
+   unsigned int i, order;
+
+   order = vmballoon_page_order(ctl->page_size);
+
+   list_for_each_entry_safe(page, tmp, >refused_pages, lru) {
+   list_del(>lru);
+   split_page(page, order);
+   for (i = 0; i < (1 << order); i++)
+   list_add([i].lru, >prealloc_pages);
+   }
+   ctl->n_refused_pages = 0;
+}
+
 /**
  * vmballoon_inflate() - Inflate the balloon towards its target size.
  *
@@ -1065,6 +1103,7 @@ static void vmballoon_inflate(struct vmballoon *b)
struct vmballoon_ctl ctl = {
.pages = LIST_HEAD_INIT(ctl.pages),
.refused_pages = LIST_HEAD_INIT(ctl.refused_pages),
+   .prealloc_pages = LIST_HEAD_INIT(ctl.prealloc_pages),
.page_size = b->max_page_size,
.op = VMW_BALLOON_INFLATE
};
@@ -1112,10 +1151,10 @@ static void vmballoon_inflate(struct vmballoon *b)
break;
 
/*
-* Ignore errors from locking as we now switch to 4k
-* pages and we might get different errors.
+ 

[PATCH v3 0/4] vmw_balloon: compaction and shrinker support

2019-04-24 Thread Nadav Amit via Virtualization
VMware balloon enhancements: adding support for memory compaction,
memory shrinker (to prevent OOM) and splitting of refused pages to
prevent recurring inflations.

Patches 1-2: Support for compaction
Patch 3: Support for memory shrinker - disabled by default
Patch 4: Split refused pages to improve performance

v2->v3:
* Fixing wrong argument type (int->size_t) [Michael]
* Fixing a comment (it) [Michael]
* Reinstating the BUG_ON() when page is locked [Michael] 

v1->v2:
* Return number of pages in list enqueue/dequeue interfaces [Michael]
* Removed first two patches which were already merged

Nadav Amit (4):
  mm/balloon_compaction: list interfaces
  vmw_balloon: compaction support
  vmw_balloon: add memory shrinker
  vmw_balloon: split refused pages

 drivers/misc/Kconfig   |   1 +
 drivers/misc/vmw_balloon.c | 489 ++---
 include/linux/balloon_compaction.h |   4 +
 mm/balloon_compaction.c| 144 ++---
 4 files changed, 553 insertions(+), 85 deletions(-)

-- 
2.19.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/4] mm/balloon_compaction: list interfaces

2019-04-19 Thread Nadav Amit
> On Apr 8, 2019, at 10:35 AM, Nadav Amit  wrote:
> 
>> On Mar 27, 2019, at 6:07 PM, Nadav Amit  wrote:
>> 
>> Introduce interfaces for ballooning enqueueing and dequeueing of a list
>> of pages. These interfaces reduce the overhead of storing and restoring
>> IRQs by batching the operations. In addition they do not panic if the
>> list of pages is empty.
>> 
>> Cc: "Michael S. Tsirkin" 
> 
> Michael, may I ping for your ack?

Ping again?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Standardizing an MSR or other hypercall to get an RNG seed?

2014-09-19 Thread Nadav Amit

On Sep 19, 2014, at 9:42 PM, Andy Lutomirski l...@amacapital.net wrote:

 On Fri, Sep 19, 2014 at 11:30 AM, Christopher Covington
 c...@codeaurora.org wrote:
 On 09/17/2014 10:50 PM, Andy Lutomirski wrote:
 Hi all-
 
 I would like to standardize on a very simple protocol by which a guest
 OS can obtain an RNG seed early in boot.
 
 The main design requirements are:
 
 - The interface should be very easy to use.  Linux, at least, will
 want to use it extremely early in boot as part of kernel ASLR.  This
 means that PCI and ACPI will not work.
 
 How do non-virtual systems get entropy this early? RDRAND/Padlock? Truerand?
 Could hypervisors and simulators simply make sure these work?
 
 
 If RDRAND is available, then Linux, at least, will use it.  The rest
 are too complicated for early use.  Linux on x86 plays some vaguely
 clever games with rdtsc and poking at the i8254 port.
 
 I think that these tricks are even less useful as a guest than they
 are on metal, and we can use paravirt mechanisms to make guest early
 boot rngs much stronger.

Sorry for interrupting, as I understand the discussion tries to be generic.

However, it sounds to me that at least for KVM, it is very easy just to emulate 
the RDRAND instruction. The hypervisor would report to the guest that RDRAND is 
supported in CPUID and the emulate the instruction when guest executes it. KVM 
already traps guest #UD (which would occur if RDRAND executed while it is not 
supported) - so this scheme wouldn’t introduce additional overhead over RDMSR.

Nadav
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization