Re: [PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-04-13 Thread David Rientjes via Virtualization
On Mon, 13 Apr 2020, Waiman Long wrote:

> As said by Linus:
> 
>   A symmetric naming is only helpful if it implies symmetries in use.
>   Otherwise it's actively misleading.
> 
>   In "kzalloc()", the z is meaningful and an important part of what the
>   caller wants.
> 
>   In "kzfree()", the z is actively detrimental, because maybe in the
>   future we really _might_ want to use that "memfill(0xdeadbeef)" or
>   something. The "zero" part of the interface isn't even _relevant_.
> 
> The main reason that kzfree() exists is to clear sensitive information
> that should not be leaked to other future users of the same memory
> objects.
> 
> Rename kzfree() to kfree_sensitive() to follow the example of the
> recently added kvfree_sensitive() and make the intention of the API
> more explicit. In addition, memzero_explicit() is used to clear the
> memory to make sure that it won't get optimized away by the compiler.
> 
> The renaming is done by using the command sequence:
> 
>   git grep -w --name-only kzfree |\
>   xargs sed -i 's/\bkzfree\b/kfree_sensitive/'
> 
> followed by some editing of the kfree_sensitive() kerneldoc and the
> use of memzero_explicit() instead of memset().
> 
> Suggested-by: Joe Perches 
> Signed-off-by: Waiman Long 

Acked-by: David Rientjes 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 62/70] x86/kvm: Add KVM specific VMMCALL handling under SEV-ES

2020-03-20 Thread David Rientjes via Virtualization
On Thu, 19 Mar 2020, Joerg Roedel wrote:

> From: Tom Lendacky 
> 
> Implement the callbacks to copy the processor state required by KVM to
> the GHCB.
> 
> Signed-off-by: Tom Lendacky 
> [ jroe...@suse.de: - Split out of a larger patch
>- Adapt to different callback functions ]
> Co-developed-by: Joerg Roedel 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/kernel/kvm.c | 35 +--
>  1 file changed, 29 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 6efe0410fb72..0e3fc798d719 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -34,6 +34,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  static int kvmapf = 1;
>  
> @@ -729,13 +731,34 @@ static void __init kvm_init_platform(void)
>   x86_platform.apic_post_init = kvm_apic_init;
>  }
>  
> +#if defined(CONFIG_AMD_MEM_ENCRYPT)
> +static void kvm_sev_es_hcall_prepare(struct ghcb *ghcb, struct pt_regs *regs)
> +{
> + /* RAX and CPL are already in the GHCB */
> + ghcb_set_rbx(ghcb, regs->bx);
> + ghcb_set_rcx(ghcb, regs->cx);
> + ghcb_set_rdx(ghcb, regs->dx);
> + ghcb_set_rsi(ghcb, regs->si);

Is it possible to check the hypercall from RAX and only copy the needed 
regs or is there a requirement that they must all be copied 
unconditionally?

> +}
> +
> +static bool kvm_sev_es_hcall_finish(struct ghcb *ghcb, struct pt_regs *regs)
> +{
> + /* No checking of the return state needed */
> + return true;
> +}
> +#endif
> +
>  const __initconst struct hypervisor_x86 x86_hyper_kvm = {
> - .name   = "KVM",
> - .detect = kvm_detect,
> - .type   = X86_HYPER_KVM,
> - .init.guest_late_init   = kvm_guest_init,
> - .init.x2apic_available  = kvm_para_available,
> - .init.init_platform = kvm_init_platform,
> + .name   = "KVM",
> + .detect = kvm_detect,
> + .type   = X86_HYPER_KVM,
> + .init.guest_late_init   = kvm_guest_init,
> + .init.x2apic_available  = kvm_para_available,
> + .init.init_platform = kvm_init_platform,
> +#if defined(CONFIG_AMD_MEM_ENCRYPT)
> + .runtime.sev_es_hcall_prepare   = kvm_sev_es_hcall_prepare,
> + .runtime.sev_es_hcall_finish= kvm_sev_es_hcall_finish,
> +#endif
>  };
>  
>  static __init int activate_jump_labels(void)
> -- 
> 2.17.1
> 
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 18/70] x86/boot/compressed/64: Add stage1 #VC handler

2020-03-20 Thread David Rientjes via Virtualization
On Thu, 19 Mar 2020, Joerg Roedel wrote:

> diff --git a/arch/x86/include/asm/sev-es.h b/arch/x86/include/asm/sev-es.h
> new file mode 100644
> index ..f524b40aef07
> --- /dev/null
> +++ b/arch/x86/include/asm/sev-es.h
> @@ -0,0 +1,45 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * AMD Encrypted Register State Support
> + *
> + * Author: Joerg Roedel 
> + */
> +
> +#ifndef __ASM_ENCRYPTED_STATE_H
> +#define __ASM_ENCRYPTED_STATE_H
> +
> +#include 
> +
> +#define GHCB_SEV_CPUID_REQ   0x004UL
> +#define  GHCB_CPUID_REQ_EAX  0
> +#define  GHCB_CPUID_REQ_EBX  1
> +#define  GHCB_CPUID_REQ_ECX  2
> +#define  GHCB_CPUID_REQ_EDX  3
> +#define  GHCB_CPUID_REQ(fn, reg) (GHCB_SEV_CPUID_REQ | \
> + (((unsigned long)reg & 3) << 30) | \
> + (((unsigned long)fn) << 32))
> +
> +#define GHCB_SEV_CPUID_RESP  0x005UL
> +#define GHCB_SEV_TERMINATE   0x100UL
> +
> +#define  GHCB_SEV_GHCB_RESP_CODE(v)  ((v) & 0xfff)
> +#define  VMGEXIT()   { asm volatile("rep; 
> vmmcall\n\r"); }

Since preemption and irqs should be disabled before updating the GHCB and 
its MSR and until the contents have been accessed following VMGEXIT, 
should there be checks in place to ensure that's always the case?

> +
> +static inline u64 lower_bits(u64 val, unsigned int bits)
> +{
> + u64 mask = (1ULL << bits) - 1;
> +
> + return (val & mask);
> +}
> +
> +static inline u64 copy_lower_bits(u64 out, u64 in, unsigned int bits)
> +{
> + u64 mask = (1ULL << bits) - 1;
> +
> + out &= ~mask;
> + out |= lower_bits(in, bits);
> +
> + return out;
> +}
> +
> +#endif
> diff --git a/arch/x86/include/asm/trap_defs.h 
> b/arch/x86/include/asm/trap_defs.h
> index 488f82ac36da..af45d65f0458 100644
> --- a/arch/x86/include/asm/trap_defs.h
> +++ b/arch/x86/include/asm/trap_defs.h
> @@ -24,6 +24,7 @@ enum {
>   X86_TRAP_AC,/* 17, Alignment Check */
>   X86_TRAP_MC,/* 18, Machine Check */
>   X86_TRAP_XF,/* 19, SIMD Floating-Point Exception */
> + X86_TRAP_VC = 29,   /* 29, VMM Communication Exception */
>   X86_TRAP_IRET = 32, /* 32, IRET Exception */
>  };
>  
> diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
> new file mode 100644
> index ..e963b48d3e86
> --- /dev/null
> +++ b/arch/x86/kernel/sev-es-shared.c
> @@ -0,0 +1,65 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * AMD Encrypted Register State Support
> + *
> + * Author: Joerg Roedel 
> + *
> + * This file is not compiled stand-alone. It contains code shared
> + * between the pre-decompression boot code and the running Linux kernel
> + * and is included directly into both code-bases.
> + */
> +
> +/*
> + * Boot VC Handler - This is the first VC handler during boot, there is no 
> GHCB
> + * page yet, so it only supports the MSR based communication with the
> + * hypervisor and only the CPUID exit-code.
> + */
> +void __init vc_no_ghcb_handler(struct pt_regs *regs, unsigned long exit_code)
> +{
> + unsigned int fn = lower_bits(regs->ax, 32);
> + unsigned long val;
> +
> + /* Only CPUID is supported via MSR protocol */
> + if (exit_code != SVM_EXIT_CPUID)
> + goto fail;
> +
> + sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EAX));
> + VMGEXIT();
> + val = sev_es_rd_ghcb_msr();
> + if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
> + goto fail;
> + regs->ax = val >> 32;
> +
> + sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EBX));
> + VMGEXIT();
> + val = sev_es_rd_ghcb_msr();
> + if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
> + goto fail;
> + regs->bx = val >> 32;
> +
> + sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_ECX));
> + VMGEXIT();
> + val = sev_es_rd_ghcb_msr();
> + if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
> + goto fail;
> + regs->cx = val >> 32;
> +
> + sev_es_wr_ghcb_msr(GHCB_CPUID_REQ(fn, GHCB_CPUID_REQ_EDX));
> + VMGEXIT();
> + val = sev_es_rd_ghcb_msr();
> + if (GHCB_SEV_GHCB_RESP_CODE(val) != GHCB_SEV_CPUID_RESP)
> + goto fail;
> + regs->dx = val >> 32;
> +
> + regs->ip += 2;
> +
> + return;
> +
> +fail:
> + sev_es_wr_ghcb_msr(GHCB_SEV_TERMINATE);
> + VMGEXIT();
> +
> + /* Shouldn't get here - if we do halt the machine */
> + while (true)
> + asm volatile("hlt\n");
> +}
> -- 
> 2.17.1
> 
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 23/70] x86/sev-es: Add support for handling IOIO exceptions

2020-03-20 Thread David Rientjes via Virtualization
On Thu, 19 Mar 2020, Joerg Roedel wrote:

> From: Tom Lendacky 
> 
> Add support for decoding and handling #VC exceptions for IOIO events.
> 
> Signed-off-by: Tom Lendacky 
> [ jroe...@suse.de: Adapted code to #VC handling framework ]
> Co-developed-by: Joerg Roedel 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/boot/compressed/sev-es.c |  32 +
>  arch/x86/kernel/sev-es-shared.c   | 202 ++
>  2 files changed, 234 insertions(+)
> 
> diff --git a/arch/x86/boot/compressed/sev-es.c 
> b/arch/x86/boot/compressed/sev-es.c
> index 193c970a3379..ae5fbd371fd9 100644
> --- a/arch/x86/boot/compressed/sev-es.c
> +++ b/arch/x86/boot/compressed/sev-es.c
> @@ -18,6 +18,35 @@
>  struct ghcb boot_ghcb_page __aligned(PAGE_SIZE);
>  struct ghcb *boot_ghcb;
>  
> +/*
> + * Copy a version of this function here - insn-eval.c can't be used in
> + * pre-decompression code.
> + */
> +static bool insn_rep_prefix(struct insn *insn)
> +{
> + int i;
> +
> + insn_get_prefixes(insn);
> +
> + for (i = 0; i < insn->prefixes.nbytes; i++) {
> + insn_byte_t p = insn->prefixes.bytes[i];
> +
> + if (p == 0xf2 || p == 0xf3)
> + return true;
> + }
> +
> + return false;
> +}
> +
> +/*
> + * Only a dummy for insn_get_seg_base() - Early boot-code is 64bit only and
> + * doesn't use segments.
> + */
> +static unsigned long insn_get_seg_base(struct pt_regs *regs, int seg_reg_idx)
> +{
> + return 0UL;
> +}
> +
>  static inline u64 sev_es_rd_ghcb_msr(void)
>  {
>   unsigned long low, high;
> @@ -117,6 +146,9 @@ void boot_vc_handler(struct pt_regs *regs, unsigned long 
> exit_code)
>   goto finish;
>  
>   switch (exit_code) {
> + case SVM_EXIT_IOIO:
> + result = vc_handle_ioio(boot_ghcb, &ctxt);
> + break;
>   default:
>   result = ES_UNSUPPORTED;
>   break;
> diff --git a/arch/x86/kernel/sev-es-shared.c b/arch/x86/kernel/sev-es-shared.c
> index f0947ea3c601..46fc5318d1d7 100644
> --- a/arch/x86/kernel/sev-es-shared.c
> +++ b/arch/x86/kernel/sev-es-shared.c
> @@ -205,3 +205,205 @@ static enum es_result vc_insn_string_write(struct 
> es_em_ctxt *ctxt,
>  
>   return ret;
>  }
> +
> +#define IOIO_TYPE_STR  BIT(2)
> +#define IOIO_TYPE_IN   1
> +#define IOIO_TYPE_INS  (IOIO_TYPE_IN | IOIO_TYPE_STR)
> +#define IOIO_TYPE_OUT  0
> +#define IOIO_TYPE_OUTS (IOIO_TYPE_OUT | IOIO_TYPE_STR)
> +
> +#define IOIO_REP   BIT(3)
> +
> +#define IOIO_ADDR_64   BIT(9)
> +#define IOIO_ADDR_32   BIT(8)
> +#define IOIO_ADDR_16   BIT(7)
> +
> +#define IOIO_DATA_32   BIT(6)
> +#define IOIO_DATA_16   BIT(5)
> +#define IOIO_DATA_8BIT(4)
> +
> +#define IOIO_SEG_ES(0 << 10)
> +#define IOIO_SEG_DS(3 << 10)
> +
> +static enum es_result vc_ioio_exitinfo(struct es_em_ctxt *ctxt, u64 
> *exitinfo)
> +{
> + struct insn *insn = &ctxt->insn;
> + *exitinfo = 0;
> +
> + switch (insn->opcode.bytes[0]) {
> + /* INS opcodes */
> + case 0x6c:
> + case 0x6d:
> + *exitinfo |= IOIO_TYPE_INS;
> + *exitinfo |= IOIO_SEG_ES;
> + *exitinfo |= (ctxt->regs->dx & 0x) << 16;
> + break;
> +
> + /* OUTS opcodes */
> + case 0x6e:
> + case 0x6f:
> + *exitinfo |= IOIO_TYPE_OUTS;
> + *exitinfo |= IOIO_SEG_DS;
> + *exitinfo |= (ctxt->regs->dx & 0x) << 16;
> + break;
> +
> + /* IN immediate opcodes */
> + case 0xe4:
> + case 0xe5:
> + *exitinfo |= IOIO_TYPE_IN;
> + *exitinfo |= insn->immediate.value << 16;
> + break;
> +
> + /* OUT immediate opcodes */
> + case 0xe6:
> + case 0xe7:
> + *exitinfo |= IOIO_TYPE_OUT;
> + *exitinfo |= insn->immediate.value << 16;
> + break;
> +
> + /* IN register opcodes */
> + case 0xec:
> + case 0xed:
> + *exitinfo |= IOIO_TYPE_IN;
> + *exitinfo |= (ctxt->regs->dx & 0x) << 16;
> + break;
> +
> + /* OUT register opcodes */
> + case 0xee:
> + case 0xef:
> + *exitinfo |= IOIO_TYPE_OUT;
> + *exitinfo |= (ctxt->regs->dx & 0x) << 16;
> + break;
> +
> + default:
> + return ES_DECODE_FAILED;
> + }
> +
> + switch (insn->opcode.bytes[0]) {
> + case 0x6c:
> + case 0x6e:
> + case 0xe4:
> + case 0xe6:
> + case 0xec:
> + case 0xee:
> + /* Single byte opcodes */
> + *exitinfo |= IOIO_DATA_8;
> + break;
> + default:
> + /* Length determined by instruction parsing */
> + *exitinfo |= (insn->opnd_bytes == 2) ? IOIO_DATA_16
> +  : IOIO_DATA_32;
> + }
> + switch (insn->addr_bytes) {
> + case 2:
> + *exitinfo |= IOIO_ADDR_16;
> + break;
> + case 4:
> + *exitinfo |= IOI

Re: [PATCH 21/70] x86/boot/compressed/64: Add function to map a page unencrypted

2020-03-20 Thread David Rientjes via Virtualization
On Thu, 19 Mar 2020, Joerg Roedel wrote:

> From: Joerg Roedel 
> 
> This function is needed to map the GHCB for SEV-ES guests. The GHCB is
> used for communication with the hypervisor, so its content must not be
> encrypted.
> 
> Signed-off-by: Joerg Roedel 
> ---
>  arch/x86/boot/compressed/ident_map_64.c | 125 
>  arch/x86/boot/compressed/misc.h |   1 +
>  2 files changed, 126 insertions(+)
> 
> diff --git a/arch/x86/boot/compressed/ident_map_64.c 
> b/arch/x86/boot/compressed/ident_map_64.c
> index feb180cced28..04a5ff4bda66 100644
> --- a/arch/x86/boot/compressed/ident_map_64.c
> +++ b/arch/x86/boot/compressed/ident_map_64.c
> @@ -26,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  /* Use the static base for this part of the boot process */
>  #undef __PAGE_OFFSET
>  #define __PAGE_OFFSET __PAGE_OFFSET_BASE
> @@ -157,6 +158,130 @@ void initialize_identity_maps(void)
>   write_cr3(top_level_pgt);
>  }
>  
> +static pte_t *split_large_pmd(struct x86_mapping_info *info,
> +   pmd_t *pmdp, unsigned long __address)
> +{
> + unsigned long page_flags;
> + unsigned long address;
> + pte_t *pte;
> + pmd_t pmd;
> + int i;
> +
> + pte = (pte_t *)info->alloc_pgt_page(info->context);
> + if (!pte)
> + return NULL;
> +
> + address = __address & PMD_MASK;
> + /* No large page - clear PSE flag */
> + page_flags  = info->page_flag & ~_PAGE_PSE;
> +
> + /* Populate the PTEs */
> + for (i = 0; i < PTRS_PER_PMD; i++) {
> + set_pte(&pte[i], __pte(address | page_flags));
> + address += PAGE_SIZE;
> + }
> +
> + /*
> +  * Ideally we need to clear the large PMD first and do a TLB
> +  * flush before we write the new PMD. But the 2M range of the
> +  * PMD might contain the code we execute and/or the stack
> +  * we are on, so we can't do that. But that should be safe here
> +  * because we are going from large to small mappings and we are
> +  * also the only user of the page-table, so there is no chance
> +  * of a TLB multihit.
> +  */
> + pmd = __pmd((unsigned long)pte | info->kernpg_flag);
> + set_pmd(pmdp, pmd);
> + /* Flush TLB to establish the new PMD */
> + write_cr3(top_level_pgt);
> +
> + return pte + pte_index(__address);
> +}
> +
> +static void clflush_page(unsigned long address)
> +{
> + unsigned int flush_size;
> + char *cl, *start, *end;
> +
> + /*
> +  * Hardcode cl-size to 64 - CPUID can't be used here because that might
> +  * cause another #VC exception and the GHCB is not ready to use yet.
> +  */
> + flush_size = 64;
> + start  = (char *)(address & PAGE_MASK);
> + end= start + PAGE_SIZE;
> +
> + /*
> +  * First make sure there are no pending writes on the cache-lines to
> +  * flush.
> +  */
> + asm volatile("mfence" : : : "memory");
> +
> + for (cl = start; cl != end; cl += flush_size)
> + clflush(cl);
> +}
> +
> +static int __set_page_decrypted(struct x86_mapping_info *info,
> + unsigned long address)
> +{
> + unsigned long scratch, *target;
> + pgd_t *pgdp = (pgd_t *)top_level_pgt;
> + p4d_t *p4dp;
> + pud_t *pudp;
> + pmd_t *pmdp;
> + pte_t *ptep, pte;
> +
> + /*
> +  * First make sure there is a PMD mapping for 'address'.
> +  * It should already exist, but keep things generic.
> +  *
> +  * To map the page just read from it and fault it in if there is no
> +  * mapping yet. add_identity_map() can't be called here because that
> +  * would unconditionally map the address on PMD level, destroying any
> +  * PTE-level mappings that might already exist.  Also do something
> +  * useless with 'scratch' so the access won't be optimized away.
> +  */
> + target = (unsigned long *)address;
> + scratch = *target;
> + arch_cmpxchg(target, scratch, scratch);
> +
> + /*
> +  * The page is mapped at least with PMD size - so skip checks and walk
> +  * directly to the PMD.
> +  */
> + p4dp = p4d_offset(pgdp, address);
> + pudp = pud_offset(p4dp, address);
> + pmdp = pmd_offset(pudp, address);
> +
> + if (pmd_large(*pmdp))
> + ptep = split_large_pmd(info, pmdp, address);
> + else
> + ptep = pte_offset_kernel(pmdp, address);
> +
> + if (!ptep)
> + return -ENOMEM;
> +
> + /* Clear encryption flag and write new pte */
> + pte = pte_clear_flags(*ptep, _PAGE_ENC);
> + set_pte(ptep, pte);
> +
> + /* Flush TLB to map the page unencrypted */
> + write_cr3(top_level_pgt);
> +

Is there a guarantee that this flushes the tlb if cr3 == top_level_pgt 
alrady without an invlpg?

> + /*
> +  * Changing encryption attributes of a page requires to flush it from
> +  * the caches.
> +  */
> + clflush_page(address

Re: [PATCH v3] virtio-balloon: Switch back to OOM handler for VIRTIO_BALLOON_F_DEFLATE_ON_OOM

2020-03-10 Thread David Rientjes via Virtualization
On Tue, 10 Mar 2020, David Hildenbrand wrote:

> Commit 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
> changed the behavior when deflation happens automatically. Instead of
> deflating when called by the OOM handler, the shrinker is used.
> 
> However, the balloon is not simply some other slab cache that should be
> shrunk when under memory pressure. The shrinker does not have a concept of
> priorities yet, so this behavior cannot be configured. Eventually once
> that is in place, we might want to switch back after doing proper
> testing.
> 
> There was a report that this results in undesired side effects when
> inflating the balloon to shrink the page cache. [1]
>   "When inflating the balloon against page cache (i.e. no free memory
>remains) vmscan.c will both shrink page cache, but also invoke the
>shrinkers -- including the balloon's shrinker. So the balloon
>driver allocates memory which requires reclaim, vmscan gets this
>memory by shrinking the balloon, and then the driver adds the
>memory back to the balloon. Basically a busy no-op."
> 
> The name "deflate on OOM" makes it pretty clear when deflation should
> happen - after other approaches to reclaim memory failed, not while
> reclaiming. This allows to minimize the footprint of a guest - memory
> will only be taken out of the balloon when really needed.
> 
> Keep using the shrinker for VIRTIO_BALLOON_F_FREE_PAGE_HINT, because
> this has no such side effects. Always register the shrinker with
> VIRTIO_BALLOON_F_FREE_PAGE_HINT now. We are always allowed to reuse free
> pages that are still to be processed by the guest. The hypervisor takes
> care of identifying and resolving possible races between processing a
> hinting request and the guest reusing a page.
> 
> In contrast to pre commit 71994620bb25 ("virtio_balloon: replace oom
> notifier with shrinker"), don't add a moodule parameter to configure the
> number of pages to deflate on OOM. Can be re-added if really needed.
> Also, pay attention that leak_balloon() returns the number of 4k pages -
> convert it properly in virtio_balloon_oom_notify().
> 
> Testing done by Tyler for future reference:
>   Test setup: VM with 16 CPU, 64GB RAM. Running Debian 10. We have a 42
>   GB file full of random bytes that we continually cat to /dev/null.
>   This fills the page cache as the file is read. Meanwhile we trigger
>   the balloon to inflate, with a target size of 53 GB. This setup causes
>   the balloon inflation to pressure the page cache as the page cache is
>   also trying to grow. Afterwards we shrink the balloon back to zero (so
>   total deflate = total inflate).
> 
>   Without patch (kernel 4.19.0-5):
>   Inflation never reaches the target until we stop the "cat file >
>   /dev/null" process. Total inflation time was 542 seconds. The longest
>   period that made no net forward progress was 315 seconds (see attached
>   graph).
>   Result of "grep balloon /proc/vmstat" after the test:
>   balloon_inflate 154828377
>   balloon_deflate 154828377
> 
>   With patch (kernel 5.6.0-rc4+):
>   Total inflation duration was 63 seconds. No deflate-queue activity
>   occurs when pressuring the page-cache.
>   Result of "grep balloon /proc/vmstat" after the test:
>   balloon_inflate 12968539
>   balloon_deflate 12968539
> 
>   Conclusion: This patch fixes the issue. In the test it reduced
>   inflate/deflate activity by 12x, and reduced inflation time by 8.6x.
>   But more importantly, if we hadn't killed the "grep balloon
>   /proc/vmstat" process then, without the patch, the inflation process
>   would never reach the target.
> 
> [1] https://www.spinics.net/lists/linux-virtualization/msg40863.html
> 
> Reported-by: Tyler Sanderson 
> Tested-by: Tyler Sanderson 
> Fixes: 71994620bb25 ("virtio_balloon: replace oom notifier with shrinker")
> Cc: Michael S. Tsirkin 
> Cc: Wei Wang 
> Cc: Alexander Duyck 
> Cc: David Rientjes 
> Cc: Nadav Amit 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Signed-off-by: David Hildenbrand 

Acked-by: David Rientjes 

Should this have:

Cc: sta...@vger.kernel.org # 4.19+

?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Balloon pressuring page cache

2020-02-01 Thread David Rientjes via Virtualization
On Wed, 29 Jan 2020, Tyler Sanderson wrote:

> > > A primary advantage of virtio balloon over other memory reclaim
> > > mechanisms is that it can pressure the guest's page cache into shrinking.
> > >
> > > However, since the balloon driver changed to using the shrinker API
> > > <
> > https://github.com/torvalds/linux/commit/71994620bb25a8b109388fefa9e99a28e355255a#diff-fd202acf694d9eba19c8c64da3e480c9
> > > this
> > > use case has become a bit more tricky. I'm wondering what the intended
> > > device implementation is.
> > >
> > > When inflating the balloon against page cache (i.e. no free memory
> > > remains) vmscan.c will both shrink page cache, but also invoke the
> > > shrinkers -- including the balloon's shrinker. So the balloon driver
> > > allocates memory which requires reclaim, vmscan gets this memory by
> > > shrinking the balloon, and then the driver adds the memory back to the
> > > balloon. Basically a busy no-op.
> > >
> > > If file IO is ongoing during this balloon inflation then the page cache
> > > could be growing which further puts "back pressure" on the balloon
> > > trying to inflate. In testing I've seen periods of > 45 seconds where
> > > balloon inflation makes no net forward progress.
> > >
> > > This wasn't a problem before the change to the shrinker API since forced
> > > balloon deflation only occurred via the OOM notifier callback which was
> > > invoked only after the page cache had depleted.
> > >
> > > Is this new busy behavior working as intended?
> >
> > Please note that the shrinker will only be registered in case we have
> > VIRTIO_BALLOON_F_DEFLATE_ON_OOM - (which is AFAIK very rare) - to
> > implement automatic balloon deflation when the guest is under memory
> > pressure.
> 
> 
> > Are you actually experiencing issues with that or did you just stumble
> > over the code?
> >
> 
> We have a use case that is encountering this (and that registers
> DEFLATE_ON_OOM). We can work around this, but it does seem inefficient.
> I understand there were good reasons for moving away from the OOM notifier
> callback, but I'm wondering if the balloon driver could specify a "nice"
> level to the shrinker API that would cause it to be reclaimed from only as
> a last resort?
> 

Shrinkers aren't priority based so I don't think there's a non-hacky way 
to avoid reclaiming from them, it's not the correct way to implement a
reclaim callback unless the objects can be fairly freed compared to other 
shrinkers.  The oom notifier callbacks are the canonical way for global 
reclaim to free memory only as a last resort.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization