[GIT PULL] Please pull powerpc/linux.git powerpc-6.7-1 tag

2023-11-02 Thread Michael Ellerman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Linus,

Please pull powerpc updates for 6.7.

There's one conflict, where we added Documentation/powerpc/kvm-nested.rst while
upstream moved Documentation/powerpc to Documentation/arch/powerpc. Resolution
is just to add it in the new location.

cheers


The following changes since commit ce9ecca0238b140b88f43859b211c9fdfd8e5b70:

  Linux 6.6-rc2 (2023-09-17 14:40:24 -0700)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-6.7-1

for you to fetch changes up to 303d77a6e1707498f09c9d8ee91b1dc07ca315a5:

  Merge branch 'topic/ppc-kvm' into next (2023-10-27 20:58:03 +1100)

- --
powerpc updates for 6.7

 - Add support for KVM running as a nested hypervisor under development versions
   of PowerVM, using the new PAPR nested virtualisation API.

 - Add support for the BPF prog pack allocator.

 - A rework of the non-server MMU handling to support execute-only on all 
platforms.

 - Some optimisations & cleanups for the powerpc qspinlock code.

 - Various other small features and fixes.

Thanks to: Aboorva Devarajan, Aditya Gupta, Amit Machhiwal, Benjamin Gray,
Christophe Leroy, Dr. David Alan Gilbert, Gaurav Batra, Gautam Menghani, Geert
Uytterhoeven, Haren Myneni, Hari Bathini, Joel Stanley, Jordan Niethe, Julia
Lawall, Kautuk Consul, Kuan-Wei Chiu, Michael Neuling, Minjie Du, Muhammad
Muzammil, Naveen N Rao, Nicholas Piggin, Nick Child, Nysal Jan K.A, Peter
Lafreniere, Rob Herring, Sachin Sant, Sebastian Andrzej Siewior, Shrikanth
Hegde, Srikar Dronamraju, Stanislav Kinsburskii, Vaibhav Jain, Wang Yufen, Yang
Yingliang, Yuan Tan.

- --
Aditya Gupta (2):
  powerpc: add `cur_cpu_spec` symbol to vmcoreinfo
  powerpc/vmcore: Add MMU information to vmcoreinfo

Benjamin Gray (12):
  powerpc/configs: Set more PPC debug configs
  powerpc/xive: Fix endian conversion size
  powerpc: Explicitly reverse bytes when checking for byte reversal
  powerpc: Use NULL instead of 0 for null pointers
  powerpc: Remove extern from function implementations
  powerpc: Annotate endianness of various variables and functions
  powerpc/kvm: Force cast endianness of KVM shared regs
  powerpc/opal: Annotate out param endianness
  powerpc/uaccess: Cast away __user annotation after verification
  powerpc: Cast away __iomem in low level IO routines
  powerpc/eeh: Remove unnecessary cast
  powerpc/fadump: Annotate endianness cast with __force

Christophe Leroy (38):
  powerpc: Only define __parse_fpscr() when required
  powerpc/40x: Remove stale PTE_ATOMIC_UPDATES macro
  powerpc: Remove pte_ERROR()
  powerpc: Deduplicate prototypes of ptep_set_access_flags() and 
phys_mem_access_prot()
  powerpc: Refactor update_mmu_cache_range()
  powerpc: Untangle fixmap.h and pgtable.h and mmu.h
  powerpc/nohash: Remove {pte/pmd}_protnone()
  powerpc/nohash: Refactor declaration of {map/unmap}_kernel_page()
  powerpc/nohash: Move 8xx version of pte_update() into pte-8xx.h
  powerpc/nohash: Replace #ifdef CONFIG_44x by IS_ENABLED(CONFIG_44x) in 
pgtable.h
  powerpc/nohash: Refactor pte_update()
  powerpc/nohash: Refactor checking of no-change in pte_update()
  powerpc/nohash: Deduplicate _PAGE_CHG_MASK
  powerpc/nohash: Deduplicate pte helpers
  powerpc/nohash: Refactor ptep_test_and_clear_young()
  powerpc/nohash: Deduplicate ptep_set_wrprotect() and ptep_get_and_clear()
  powerpc/nohash: Refactor pte_clear()
  powerpc/nohash: Refactor __ptep_set_access_flags()
  powerpc/e500: Simplify pte_mkexec()
  powerpc: Implement and use pgprot_nx()
  powerpc: Fail ioremap() instead of silently ignoring flags when PAGE_USER 
is set
  powerpc: Remove pte_mkuser() and pte_mkpriviledged()
  powerpc: Rely on address instead of pte_user()
  powerpc: Refactor permission masks used for __P/__S table and kernel 
memory flags
  powerpc/8xx: Use generic permission masks
  powerpc/64s: Use generic permission masks
  powerpc/nohash: Add _PAGE_WRITE to supplement _PAGE_RW
  powerpc/nohash: Replace pte_user() by pte_read()
  powerpc/e500: Introduce _PAGE_READ and remove _PAGE_USER
  powerpc/44x: Introduce _PAGE_READ and remove _PAGE_USER
  powerpc/40x: Introduce _PAGE_READ and remove _PAGE_USER
  powerpc/32s: Add _PAGE_WRITE to supplement _PAGE_RW
  powerpc/32s: Introduce _PAGE_READ and remove _PAGE_USER
  powerpc/ptdump: Display _PAGE_READ and _PAGE_WRITE
  powerpc: Finally remove _PAGE_USER
  powerpc: Support execute-only on all powerpc
  powerpc: Remove cpm_dp...() macros
  powerpc/code-patching: Perform hwsync in __patch_instruction() in case of 
failure

Dr. David Alan Gilbert (1):
  powerpc: Use shared font data

Gaura

Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Xu Yilun
On Fri, Oct 27, 2023 at 11:21:51AM -0700, Sean Christopherson wrote:
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6723,6 +6723,26 @@ array field represents return values. The userspace 
> should update the return
>  values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>  spec refer, https://github.com/riscv/riscv-sbi-doc.
>  
> +::
> +
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> + __u64 flags;
> + __u64 gpa;
> + __u64 size;
> + } memory;
  ^

Should update to "memory_fault" to align with other places.

[...]

> @@ -520,6 +521,12 @@ struct kvm_run {
>  #define KVM_NOTIFY_CONTEXT_INVALID   (1 << 0)
>   __u32 flags;
>   } notify;
> + /* KVM_EXIT_MEMORY_FAULT */
> + struct {
> + __u64 flags;
> + __u64 gpa;
> + __u64 size;
> + } memory_fault;
>   /* Fix the size of the union. */
>   char padding[256];
>   };

Thanks,
Yilun

> 


Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Huang, Kai
On Thu, 2023-11-02 at 08:44 -0700, Sean Christopherson wrote:
> On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> > On 11/2/23 10:35, Huang, Kai wrote:
> > > IIUC KVM can already handle the case of poisoned
> > > page by sending signal to user app:
> > > 
> > >   static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu,  
> > > struct
> > > kvm_page_fault *fault)    
> > > {
> > >   ...
> > > 
> > >   if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
> > >       kvm_send_hwpoison_signal(fault->slot, 
> > > fault->gfn);
> 
> No, this doesn't work, because that signals the host virtual address

Ah, right :-)


Re: [PATCH v3 4/4] PCI: layerscape: Add suspend/resume for ls1043a

2023-11-02 Thread Frank Li
On Thu, Nov 02, 2023 at 11:09:00PM +0530, Manivannan Sadhasivam wrote:
> On Tue, Oct 17, 2023 at 03:31:45PM -0400, Frank Li wrote:
> > ls1043a add suspend/resume support.
> > Implement ls1043a_pcie_send_turnoff_msg() to send PME_Turn_Off message.
> > Implement ls1043a_pcie_exit_from_l2() to exit from L2 state.
> > 
> 
> Please use the suggestion I gave in patch 2/4.
> 
> > Signed-off-by: Frank Li 
> > ---
> > 
> > Notes:
> > Change from v2 to v3
> > - Remove ls_pcie_lut_readl(writel) function
> > 
> > Change from v1 to v2
> > - Update subject 'a' to 'A'
> > 
> >  drivers/pci/controller/dwc/pci-layerscape.c | 86 -
> >  1 file changed, 85 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c 
> > b/drivers/pci/controller/dwc/pci-layerscape.c
> > index 4b663b20d8612..9656224960b0c 100644
> > --- a/drivers/pci/controller/dwc/pci-layerscape.c
> > +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> > @@ -41,6 +41,15 @@
> >  #define SCFG_PEXSFTRSTCR   0x190
> >  #define PEXSR(idx) BIT(idx)
> >  
> > +/* LS1043A PEX PME control register */
> > +#define SCFG_PEXPMECR  0x144
> > +#define PEXPME(idx)BIT(31 - (idx) * 4)
> > +
> > +/* LS1043A PEX LUT debug register */
> > +#define LS_PCIE_LDBG   0x7fc
> > +#define LDBG_SRBIT(30)
> > +#define LDBG_WEBIT(31)
> > +
> >  #define PCIE_IATU_NUM  6
> >  
> >  #define LS_PCIE_DRV_SCFG   BIT(0)
> > @@ -227,6 +236,68 @@ static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp 
> > *pp)
> > return 0;
> >  }
> >  
> > +static void ls1043a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp)
> > +{
> > +   struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> > +   struct ls_pcie *pcie = to_ls_pcie(pci);
> > +   u32 val;
> > +
> > +   if (!pcie->scfg) {
> > +   dev_dbg(pcie->pci->dev, "SYSCFG is NULL\n");
> > +   return;
> > +   }
> 
> Why scfg is optional for this SoC and not for the other one added in patch 
> 2/4?

No, it is not optional for this SoC. This check can be removed as your
previous comments about 2/4.

> 
> > +
> > +   /* Send Turn_off message */
> > +   regmap_read(pcie->scfg, SCFG_PEXPMECR, &val);
> > +   val |= PEXPME(pcie->index);
> > +   regmap_write(pcie->scfg, SCFG_PEXPMECR, val);
> > +
> 
> In my previous review, I asked you to use a common function and just pass the
> offsets, as the sequence is same for both the SoCs. But you ignored it :/
> 

Sorry, I will fixed it at next version. 

> > +   /*
> > +* There is no specific register to check for PME_To_Ack from endpoint.
> > +* So on the safe side, wait for PCIE_PME_TO_L2_TIMEOUT_US.
> > +*/
> > +   mdelay(PCIE_PME_TO_L2_TIMEOUT_US/1000);
> > +
> > +   /*
> > +* Layerscape hardware reference manual recommends clearing the 
> > PMXMTTURNOFF bit
> > +* to complete the PME_Turn_Off handshake.
> > +*/
> > +   regmap_read(pcie->scfg, SCFG_PEXPMECR, &val);
> > +   val &= ~PEXPME(pcie->index);
> > +   regmap_write(pcie->scfg, SCFG_PEXPMECR, val);
> > +}
> > +
> > +static int ls1043a_pcie_exit_from_l2(struct dw_pcie_rp *pp)
> > +{
> > +   struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> > +   struct ls_pcie *pcie = to_ls_pcie(pci);
> > +   u32 val;
> > +
> > +   /*
> > +* Only way let PEX module exit L2 is do a software reset.
> 
> Same comment applies as patch 2/4.
> 
> - Mani
> 
> -- 
> மணிவண்ணன் சதாசிவம்


Re: [PATCH v3 3/4] PCI: layerscape: Rename pf_* as pf_lut_*

2023-11-02 Thread Frank Li
On Thu, Nov 02, 2023 at 11:03:14PM +0530, Manivannan Sadhasivam wrote:
> On Tue, Oct 17, 2023 at 03:31:44PM -0400, Frank Li wrote:
> > 'pf' and 'lut' is just difference name in difference chips, but basic it is
> > a MMIO base address plus an offset.
> > 
> > Rename it to avoid duplicate pf_* and lut_* in driver.
> > 
> 
> "pci-layerscape-ep.c" uses "ls_lut_" prefix and now you are using "pf_lut_". 
> May
> I know the difference between these two? Can we just use a common name?

Some chip use name lut, some chip use name pf. I think ls_pcie_pf_lut_*()
is better name then 'ls_lut_' in pci-layerscape-ep.c to align with spec. 

If need, I can rename "ls_lut_" in "pci-layerscape-ep.c" later.

Frank

> 
> - Mani
> 
> > Signed-off-by: Frank Li 
> > ---
> > 
> > Notes:
> > change from v1 to v3
> > - new patch at v3
> > 
> >  drivers/pci/controller/dwc/pci-layerscape.c | 34 ++---
> >  1 file changed, 17 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c 
> > b/drivers/pci/controller/dwc/pci-layerscape.c
> > index 6f47cfe146c44..4b663b20d8612 100644
> > --- a/drivers/pci/controller/dwc/pci-layerscape.c
> > +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> > @@ -46,7 +46,7 @@
> >  #define LS_PCIE_DRV_SCFG   BIT(0)
> >  
> >  struct ls_pcie_drvdata {
> > -   const u32 pf_off;
> > +   const u32 pf_lut_off;
> > const struct dw_pcie_host_ops *ops;
> > int (*exit_from_l2)(struct dw_pcie_rp *pp);
> > int flags;
> > @@ -56,13 +56,13 @@ struct ls_pcie_drvdata {
> >  struct ls_pcie {
> > struct dw_pcie *pci;
> > const struct ls_pcie_drvdata *drvdata;
> > -   void __iomem *pf_base;
> > +   void __iomem *pf_lut_base;
> > struct regmap *scfg;
> > int index;
> > bool big_endian;
> >  };
> >  
> > -#define ls_pcie_pf_readl_addr(addr)ls_pcie_pf_readl(pcie, addr)
> > +#define ls_pcie_pf_lut_readl_addr(addr)ls_pcie_pf_lut_readl(pcie, addr)
> >  #define to_ls_pcie(x)  dev_get_drvdata((x)->dev)
> >  
> >  static bool ls_pcie_is_bridge(struct ls_pcie *pcie)
> > @@ -103,20 +103,20 @@ static void ls_pcie_fix_error_response(struct ls_pcie 
> > *pcie)
> > iowrite32(PCIE_ABSERR_SETTING, pci->dbi_base + PCIE_ABSERR);
> >  }
> >  
> > -static u32 ls_pcie_pf_readl(struct ls_pcie *pcie, u32 off)
> > +static u32 ls_pcie_pf_lut_readl(struct ls_pcie *pcie, u32 off)
> >  {
> > if (pcie->big_endian)
> > -   return ioread32be(pcie->pf_base + off);
> > +   return ioread32be(pcie->pf_lut_base + off);
> >  
> > -   return ioread32(pcie->pf_base + off);
> > +   return ioread32(pcie->pf_lut_base + off);
> >  }
> >  
> > -static void ls_pcie_pf_writel(struct ls_pcie *pcie, u32 off, u32 val)
> > +static void ls_pcie_pf_lut_writel(struct ls_pcie *pcie, u32 off, u32 val)
> >  {
> > if (pcie->big_endian)
> > -   iowrite32be(val, pcie->pf_base + off);
> > +   iowrite32be(val, pcie->pf_lut_base + off);
> > else
> > -   iowrite32(val, pcie->pf_base + off);
> > +   iowrite32(val, pcie->pf_lut_base + off);
> >  }
> >  
> >  static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp *pp)
> > @@ -126,11 +126,11 @@ static void ls_pcie_send_turnoff_msg(struct 
> > dw_pcie_rp *pp)
> > u32 val;
> > int ret;
> >  
> > -   val = ls_pcie_pf_readl(pcie, LS_PCIE_PF_MCR);
> > +   val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_PF_MCR);
> > val |= PF_MCR_PTOMR;
> > -   ls_pcie_pf_writel(pcie, LS_PCIE_PF_MCR, val);
> > +   ls_pcie_pf_lut_writel(pcie, LS_PCIE_PF_MCR, val);
> >  
> > -   ret = readx_poll_timeout(ls_pcie_pf_readl_addr, LS_PCIE_PF_MCR,
> > +   ret = readx_poll_timeout(ls_pcie_pf_lut_readl_addr, LS_PCIE_PF_MCR,
> >  val, !(val & PF_MCR_PTOMR),
> >  PCIE_PME_TO_L2_TIMEOUT_US/10,
> >  PCIE_PME_TO_L2_TIMEOUT_US);
> > @@ -149,15 +149,15 @@ static int ls_pcie_exit_from_l2(struct dw_pcie_rp *pp)
> >  * Set PF_MCR_EXL2S bit in LS_PCIE_PF_MCR register for the link
> >  * to exit L2 state.
> >  */
> > -   val = ls_pcie_pf_readl(pcie, LS_PCIE_PF_MCR);
> > +   val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_PF_MCR);
> > val |= PF_MCR_EXL2S;
> > -   ls_pcie_pf_writel(pcie, LS_PCIE_PF_MCR, val);
> > +   ls_pcie_pf_lut_writel(pcie, LS_PCIE_PF_MCR, val);
> >  
> > /*
> >  * L2 exit timeout of 10ms is not defined in the specifications,
> >  * it was chosen based on empirical observations.
> >  */
> > -   ret = readx_poll_timeout(ls_pcie_pf_readl_addr, LS_PCIE_PF_MCR,
> > +   ret = readx_poll_timeout(ls_pcie_pf_lut_readl_addr, LS_PCIE_PF_MCR,
> >  val, !(val & PF_MCR_EXL2S),
> >  1000,
> >  1);
> > @@ -245,7 +245,7 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = {
> >  };
> >  
> >  static const struct ls_pcie_drvdata layerscape_drvdata = {
> > -   .pf_off = 0xc,
> > +   .pf_lut_off = 0xc,
> 

Re: [PATCH v3 2/4] PCI: layerscape: Add suspend/resume for ls1021a

2023-11-02 Thread Frank Li
On Thu, Nov 02, 2023 at 10:58:09PM +0530, Manivannan Sadhasivam wrote:
> On Tue, Oct 17, 2023 at 03:31:43PM -0400, Frank Li wrote:
> > ls1021a add suspend/resume support.
> > 
> > Implement callback ls1021a_pcie_send_turnoff_msg(), which write scfg's
> > SCFG_PEXPMWRCR to issue PME_Turn_off message.
> > 
> > Implement ls1021a_pcie_exit_from_l2() to let controller exit L2 state.
> > 
> 
> I'd like to reword it to better reflect what the patch does:
> 
> "In the suspend path, PME_Turn_Off message is sent to the endpoint to 
> transition
> the link to L2/L3_Ready state. In this SoC, there is no way to check if the
> controller has received the PME_To_Ack from the endpoint or not. So to be on 
> the
> safer side, the driver just waits for PCIE_PME_TO_L2_TIMEOUT_US before 
> asserting
> the SoC specific PMXMTTURNOFF bit to complete the PME_Turn_Off handshake. This
> link would then enter L2/L3 state depending on the VAUX supply.
> 
> In the resume path, the link is brought back from L2 to L0 by doing a software
> reset."
> 
> Although I do have questions on the resume behavior below.
> 
> > Signed-off-by: Frank Li 
> > ---
> > 
> > Notes:
> > Change from v2 to v3
> > - update according to mani's feedback
> > change from v1 to v2
> > - change subject 'a' to 'A'
> > 
> >  drivers/pci/controller/dwc/pci-layerscape.c | 86 -
> >  1 file changed, 85 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c 
> > b/drivers/pci/controller/dwc/pci-layerscape.c
> > index aea89926bcc4f..6f47cfe146c44 100644
> > --- a/drivers/pci/controller/dwc/pci-layerscape.c
> > +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> > @@ -35,11 +35,21 @@
> >  #define PF_MCR_PTOMR   BIT(0)
> >  #define PF_MCR_EXL2S   BIT(1)
> >  
> > +/* LS1021A PEXn PM Write Control Register */
> > +#define SCFG_PEXPMWRCR(idx)(0x5c + (idx) * 0x64)
> > +#define PMXMTTURNOFF   BIT(31)
> > +#define SCFG_PEXSFTRSTCR   0x190
> > +#define PEXSR(idx) BIT(idx)
> > +
> >  #define PCIE_IATU_NUM  6
> >  
> > +#define LS_PCIE_DRV_SCFG   BIT(0)
> > +
> >  struct ls_pcie_drvdata {
> > const u32 pf_off;
> > +   const struct dw_pcie_host_ops *ops;
> > int (*exit_from_l2)(struct dw_pcie_rp *pp);
> > +   int flags;
> 
> Why not "bool scfg_support"?

It will be easy to add new flag if need in future.

> 
> > bool pm_support;
> >  };
> >  
> > @@ -47,6 +57,8 @@ struct ls_pcie {
> > struct dw_pcie *pci;
> > const struct ls_pcie_drvdata *drvdata;
> > void __iomem *pf_base;
> > +   struct regmap *scfg;
> > +   int index;
> > bool big_endian;
> >  };
> >  
> > @@ -171,13 +183,65 @@ static int ls_pcie_host_init(struct dw_pcie_rp *pp)
> > return 0;
> >  }
> >  
> > +static void ls1021a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp)
> > +{
> > +   struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> > +   struct ls_pcie *pcie = to_ls_pcie(pci);
> > +   u32 val;
> > +
> > +   /* Send PME_Turn_Off message */
> > +   regmap_read(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), &val);
> > +   val |= PMXMTTURNOFF;
> > +   regmap_write(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), val);
> > +
> > +   /*
> > +* There is no specific register to check for PME_To_Ack from endpoint.
> > +* So on the safe side, wait for PCIE_PME_TO_L2_TIMEOUT_US.
> > +*/
> > +   mdelay(PCIE_PME_TO_L2_TIMEOUT_US/1000);
> > +
> > +   /*
> > +* Layerscape hardware reference manual recommends clearing the 
> > PMXMTTURNOFF bit
> > +* to complete the PME_Turn_Off handshake.
> > +*/
> > +   regmap_read(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), &val);
> > +   val &= ~PMXMTTURNOFF;
> > +   regmap_write(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), val);
> > +}
> > +
> > +static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp *pp)
> > +{
> > +   struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> > +   struct ls_pcie *pcie = to_ls_pcie(pci);
> > +   u32 val;
> > +
> > +   /* Only way exit from l2 is that do software reset */
> 
> So, what does exactly "software reset" mean? Are you resetting the endpoint or
> some specific registers/blocks in the controlleri?

No, it is PCIe controller reset. Not touch endpoint.

> 
> Also, what if the link goes to L3 in the case of no VAUX?

I am not exactly sure. it should be related with board design. I supposed
not big difference!

We can improve it when we met it.

> 
> > +   regmap_read(pcie->scfg, SCFG_PEXSFTRSTCR, &val);
> > +   val |= PEXSR(pcie->index);
> > +   regmap_write(pcie->scfg, SCFG_PEXSFTRSTCR, val);
> > +
> > +   regmap_read(pcie->scfg, SCFG_PEXSFTRSTCR, &val);
> > +   val &= ~PEXSR(pcie->index);
> > +   regmap_write(pcie->scfg, SCFG_PEXSFTRSTCR, val);
> > +
> > +   return 0;
> > +}
> > +
> >  static const struct dw_pcie_host_ops ls_pcie_host_ops = {
> > .host_init = ls_pcie_host_init,
> > .pme_turn_off = ls_pcie_send_turnoff_msg,
> >  };
> >  
> > +static const struct dw_pcie_host_ops ls1021a_pcie_h

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-02 Thread Sean Christopherson
On Thu, Nov 02, 2023, David Matlack wrote:
> On Thu, Nov 2, 2023 at 9:03 AM Sean Christopherson  wrote:
> >
> > On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> > > On 10/31/23 23:39, David Matlack wrote:
> > > > > > Maybe can you sketch out how you see this proposal being extensible 
> > > > > > to
> > > > > > using guest_memfd for shared mappings?
> > > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is 
> > > > > needed.  What's
> > > > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM 
> > > > > needs to
> > > > > ensure there are no outstanding references when converting back to 
> > > > > private.
> > > > >
> > > > > For TDX/SNP, assuming we don't find a performant and robust way to do 
> > > > > in-place
> > > > > conversions, a second fd+offset pair would be needed.
> > > > Is there a way to support non-in-place conversions within a single 
> > > > guest_memfd?
> > >
> > > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest
> > > memory.  The hook would invalidate now-private parts if they have a VMA,
> > > causing a SIGSEGV/EFAULT if the host touches them.
> > >
> > > It would forbid mappings from multiple gfns to a single offset of the
> > > guest_memfd, because then the shared vs. private attribute would be tied 
> > > to
> > > the offset.  This should not be a problem; for example, in the case of 
> > > SNP,
> > > the RMP already requires a single mapping from host physical address to
> > > guest physical address.
> >
> > I don't see how this can work.  It's not a M:1 scenario (where M is 
> > multiple gfns),
> > it's a 1:N scenario (wheren N is multiple offsets).  The *gfn* doesn't 
> > change on
> > a conversion, what needs to change to do non-in-place conversion is the 
> > pfn, which
> > is effectively the guest_memfd+offset pair.
> >
> > So yes, we *could* support non-in-place conversions within a single 
> > guest_memfd,
> > but it would require a second offset,
> 
> Why can't KVM free the existing page at guest_memfd+offset and
> allocate a new one when doing non-in-place conversions?

Oh, I see what you're suggesting.  Eww.

It's certainly possible, but it would largely defeat the purpose of why we are
adding guest_memfd in the first place.

For TDX and SNP, the goal is to provide a simple, robust mechanism for isolating
guest private memory so that it's all but impossible for the host to access 
private
memory.  As things stand, memory for a given guest_memfd is either private or 
shared
(assuming we support a second guest_memfd per memslot).  I.e. there's no need to
track whether a given page/folio in the guest_memfd is private vs. shared.

We could use memory attributes, but that further complicates things when 
intrahost
migration (and potentially other multi-user scenarios) comes along, i.e. when 
KVM
supports linking multiple guest_memfd files to a single inode.  We'd have to 
ensure
that all "struct kvm" instances have identical PRIVATE attributes for a given
*offset* in the inode.  I'm not even sure how feasible that is for intrahost
migration, and that's the *easy* case, because IIRC it's already a hard 
requirement
that the source and destination have identical gnf=>guest_memfd bindings, i.e. 
KVM
can somewhat easily reason about gfn attributes.

But even then, that only helps with the actual migration of the VM, e.g. we'd 
still
have to figure out how to deal with .mmap() and other shared vs. private actions
when linking a new guest_memfd file against an existing inode.

I haven't seen the pKVM patches for supporting .mmap(), so maybe this is already
a solved problem, but I'd honestly be quite surprised if it all works correctly
if/when KVM supports multiple files per inode.

And I don't see what value non-in-place conversions would add.  The value added
by in-place conversions, aside from the obvious preservation of data, which 
isn't
relevant to TDX/SNP, is that it doesn't require freeing and reallocating memory
to avoid double-allocating for private vs. shared.  That's especialy quite nice
when hugepages are being used because reconstituing a hugepage "only" requires
zapping SPTEs.

But if KVM is freeing the private page, it's the same as punching a hole, 
probably
quite literally, when mapping the gfn as shared.  In every way I can think of, 
it's
worse.  E.g. it's more complex for KVM, and the PUNCH_HOLE => allocation 
operations
must be serialized.

Regarding double-allocating, I really, really think we should solve that in the
guest.  I.e. teach Linux-as-a-guest to aggressively convert at 2MiB granularity
and avoid 4KiB conversions.  4KiB conversions aren't just a memory utilization
problem, they're also a performance problem, e.g. shatters hugepages (which KVM
doesn't yet support recovering) and increases TLB pressure for both stage-1 and
stage-2 mappings.


Re: [PATCH v3 2/4] PCI: layerscape: Add suspend/resume for ls1021a

2023-11-02 Thread Manivannan Sadhasivam
On Tue, Oct 17, 2023 at 03:31:43PM -0400, Frank Li wrote:
> ls1021a add suspend/resume support.
> 
> Implement callback ls1021a_pcie_send_turnoff_msg(), which write scfg's
> SCFG_PEXPMWRCR to issue PME_Turn_off message.
> 
> Implement ls1021a_pcie_exit_from_l2() to let controller exit L2 state.
> 

I'd like to reword it to better reflect what the patch does:

"In the suspend path, PME_Turn_Off message is sent to the endpoint to transition
the link to L2/L3_Ready state. In this SoC, there is no way to check if the
controller has received the PME_To_Ack from the endpoint or not. So to be on the
safer side, the driver just waits for PCIE_PME_TO_L2_TIMEOUT_US before asserting
the SoC specific PMXMTTURNOFF bit to complete the PME_Turn_Off handshake. This
link would then enter L2/L3 state depending on the VAUX supply.

In the resume path, the link is brought back from L2 to L0 by doing a software
reset."

Although I do have questions on the resume behavior below.

> Signed-off-by: Frank Li 
> ---
> 
> Notes:
> Change from v2 to v3
> - update according to mani's feedback
> change from v1 to v2
> - change subject 'a' to 'A'
> 
>  drivers/pci/controller/dwc/pci-layerscape.c | 86 -
>  1 file changed, 85 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/dwc/pci-layerscape.c 
> b/drivers/pci/controller/dwc/pci-layerscape.c
> index aea89926bcc4f..6f47cfe146c44 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> @@ -35,11 +35,21 @@
>  #define PF_MCR_PTOMR BIT(0)
>  #define PF_MCR_EXL2S BIT(1)
>  
> +/* LS1021A PEXn PM Write Control Register */
> +#define SCFG_PEXPMWRCR(idx)  (0x5c + (idx) * 0x64)
> +#define PMXMTTURNOFF BIT(31)
> +#define SCFG_PEXSFTRSTCR 0x190
> +#define PEXSR(idx)   BIT(idx)
> +
>  #define PCIE_IATU_NUM6
>  
> +#define LS_PCIE_DRV_SCFG BIT(0)
> +
>  struct ls_pcie_drvdata {
>   const u32 pf_off;
> + const struct dw_pcie_host_ops *ops;
>   int (*exit_from_l2)(struct dw_pcie_rp *pp);
> + int flags;

Why not "bool scfg_support"?

>   bool pm_support;
>  };
>  
> @@ -47,6 +57,8 @@ struct ls_pcie {
>   struct dw_pcie *pci;
>   const struct ls_pcie_drvdata *drvdata;
>   void __iomem *pf_base;
> + struct regmap *scfg;
> + int index;
>   bool big_endian;
>  };
>  
> @@ -171,13 +183,65 @@ static int ls_pcie_host_init(struct dw_pcie_rp *pp)
>   return 0;
>  }
>  
> +static void ls1021a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp)
> +{
> + struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> + struct ls_pcie *pcie = to_ls_pcie(pci);
> + u32 val;
> +
> + /* Send PME_Turn_Off message */
> + regmap_read(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), &val);
> + val |= PMXMTTURNOFF;
> + regmap_write(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), val);
> +
> + /*
> +  * There is no specific register to check for PME_To_Ack from endpoint.
> +  * So on the safe side, wait for PCIE_PME_TO_L2_TIMEOUT_US.
> +  */
> + mdelay(PCIE_PME_TO_L2_TIMEOUT_US/1000);
> +
> + /*
> +  * Layerscape hardware reference manual recommends clearing the 
> PMXMTTURNOFF bit
> +  * to complete the PME_Turn_Off handshake.
> +  */
> + regmap_read(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), &val);
> + val &= ~PMXMTTURNOFF;
> + regmap_write(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), val);
> +}
> +
> +static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp *pp)
> +{
> + struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> + struct ls_pcie *pcie = to_ls_pcie(pci);
> + u32 val;
> +
> + /* Only way exit from l2 is that do software reset */

So, what does exactly "software reset" mean? Are you resetting the endpoint or
some specific registers/blocks in the controller?

Also, what if the link goes to L3 in the case of no VAUX?

> + regmap_read(pcie->scfg, SCFG_PEXSFTRSTCR, &val);
> + val |= PEXSR(pcie->index);
> + regmap_write(pcie->scfg, SCFG_PEXSFTRSTCR, val);
> +
> + regmap_read(pcie->scfg, SCFG_PEXSFTRSTCR, &val);
> + val &= ~PEXSR(pcie->index);
> + regmap_write(pcie->scfg, SCFG_PEXSFTRSTCR, val);
> +
> + return 0;
> +}
> +
>  static const struct dw_pcie_host_ops ls_pcie_host_ops = {
>   .host_init = ls_pcie_host_init,
>   .pme_turn_off = ls_pcie_send_turnoff_msg,
>  };
>  
> +static const struct dw_pcie_host_ops ls1021a_pcie_host_ops = {
> + .host_init = ls_pcie_host_init,
> + .pme_turn_off = ls1021a_pcie_send_turnoff_msg,
> +};
> +
>  static const struct ls_pcie_drvdata ls1021a_drvdata = {
> - .pm_support = false,
> + .pm_support = true,
> + .ops = &ls1021a_pcie_host_ops,
> + .exit_from_l2 = ls1021a_pcie_exit_from_l2,
> + .flags = LS_PCIE_DRV_SCFG,
>  };
>  
>  static const struct ls_pcie_drvdata layerscape_drvdata = {
> @@ -205,6 +269,8 @@ static int ls_pcie_probe(struct platform_device *pdev)
>   str

Re: [PATCH v3 1/4] PCI: layerscape: Add function pointer for exit_from_l2()

2023-11-02 Thread Frank Li
On Thu, Nov 02, 2023 at 10:28:08PM +0530, Manivannan Sadhasivam wrote:
> On Tue, Oct 17, 2023 at 03:31:42PM -0400, Frank Li wrote:
> > Since difference SoCs require different sequence for exiting L2, let's add
> > a separate "exit_from_l2()" callback. This callback can be used to execute
> > SoC specific sequence.
> > 
> 
> I missed the fact that this patch honors the return value of the callback 
> (which
> was ignored previously). So this should be added to the description as well.

How about add below?

"Change ls_pcie_exit_from_l2() return value from void to int. Return error
when exit_from_l2() failure to exit suspend flow."

Frank


> 
> > Signed-off-by: Frank Li 
> 
> With that,
> 
> Reviewed-by: Manivannan Sadhasivam 
> 
> - Mani
> 
> > ---
> > 
> > Notes:
> > Change from v2 to v3
> > - fixed according to mani's feedback
> >   1. update commit message
> >   2. move dw_pcie_host_ops to next patch
> >   3. check return value from exit_from_l2()
> > Change from v1 to v2
> > - change subject 'a' to 'A'
> > 
> > Change from v1 to v2
> > - change subject 'a' to 'A'
> > 
> >  drivers/pci/controller/dwc/pci-layerscape.c | 11 +--
> >  1 file changed, 9 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c 
> > b/drivers/pci/controller/dwc/pci-layerscape.c
> > index 37956e09c65bd..aea89926bcc4f 100644
> > --- a/drivers/pci/controller/dwc/pci-layerscape.c
> > +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> > @@ -39,6 +39,7 @@
> >  
> >  struct ls_pcie_drvdata {
> > const u32 pf_off;
> > +   int (*exit_from_l2)(struct dw_pcie_rp *pp);
> > bool pm_support;
> >  };
> >  
> > @@ -125,7 +126,7 @@ static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp 
> > *pp)
> > dev_err(pcie->pci->dev, "PME_Turn_off timeout\n");
> >  }
> >  
> > -static void ls_pcie_exit_from_l2(struct dw_pcie_rp *pp)
> > +static int ls_pcie_exit_from_l2(struct dw_pcie_rp *pp)
> >  {
> > struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> > struct ls_pcie *pcie = to_ls_pcie(pci);
> > @@ -150,6 +151,8 @@ static void ls_pcie_exit_from_l2(struct dw_pcie_rp *pp)
> >  1);
> > if (ret)
> > dev_err(pcie->pci->dev, "L2 exit timeout\n");
> > +
> > +   return ret;
> >  }
> >  
> >  static int ls_pcie_host_init(struct dw_pcie_rp *pp)
> > @@ -180,6 +183,7 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = {
> >  static const struct ls_pcie_drvdata layerscape_drvdata = {
> > .pf_off = 0xc,
> > .pm_support = true,
> > +   .exit_from_l2 = ls_pcie_exit_from_l2,
> >  };
> >  
> >  static const struct of_device_id ls_pcie_of_match[] = {
> > @@ -247,11 +251,14 @@ static int ls_pcie_suspend_noirq(struct device *dev)
> >  static int ls_pcie_resume_noirq(struct device *dev)
> >  {
> > struct ls_pcie *pcie = dev_get_drvdata(dev);
> > +   int ret;
> >  
> > if (!pcie->drvdata->pm_support)
> > return 0;
> >  
> > -   ls_pcie_exit_from_l2(&pcie->pci->pp);
> > +   ret = pcie->drvdata->exit_from_l2(&pcie->pci->pp);
> > +   if (ret)
> > +   return ret;
> >  
> > return dw_pcie_resume_noirq(pcie->pci);
> >  }
> > -- 
> > 2.34.1
> > 
> 
> -- 
> மணிவண்ணன் சதாசிவம்


Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-02 Thread Sean Christopherson
On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> On 10/31/23 23:39, David Matlack wrote:
> > > > Maybe can you sketch out how you see this proposal being extensible to
> > > > using guest_memfd for shared mappings?
> > > For in-place conversions, e.g. pKVM, no additional guest_memfd is needed. 
> > >  What's
> > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM 
> > > needs to
> > > ensure there are no outstanding references when converting back to 
> > > private.
> > > 
> > > For TDX/SNP, assuming we don't find a performant and robust way to do 
> > > in-place
> > > conversions, a second fd+offset pair would be needed.
> > Is there a way to support non-in-place conversions within a single 
> > guest_memfd?
> 
> For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest
> memory.  The hook would invalidate now-private parts if they have a VMA,
> causing a SIGSEGV/EFAULT if the host touches them.
> 
> It would forbid mappings from multiple gfns to a single offset of the
> guest_memfd, because then the shared vs. private attribute would be tied to
> the offset.  This should not be a problem; for example, in the case of SNP,
> the RMP already requires a single mapping from host physical address to
> guest physical address.

I don't see how this can work.  It's not a M:1 scenario (where M is multiple 
gfns),
it's a 1:N scenario (wheren N is multiple offsets).  The *gfn* doesn't change on
a conversion, what needs to change to do non-in-place conversion is the pfn, 
which
is effectively the guest_memfd+offset pair.

So yes, we *could* support non-in-place conversions within a single guest_memfd,
but it would require a second offset, at which point it makes sense to add a
second file descriptor as well.  Userspace could still use a single guest_memfd
instance, i.e. pass in the same file descriptor but different offsets.


Re: [PATCH v3 3/4] PCI: layerscape: Rename pf_* as pf_lut_*

2023-11-02 Thread Manivannan Sadhasivam
On Tue, Oct 17, 2023 at 03:31:44PM -0400, Frank Li wrote:
> 'pf' and 'lut' is just difference name in difference chips, but basic it is
> a MMIO base address plus an offset.
> 
> Rename it to avoid duplicate pf_* and lut_* in driver.
> 

"pci-layerscape-ep.c" uses "ls_lut_" prefix and now you are using "pf_lut_". May
I know the difference between these two? Can we just use a common name?

- Mani

> Signed-off-by: Frank Li 
> ---
> 
> Notes:
> change from v1 to v3
> - new patch at v3
> 
>  drivers/pci/controller/dwc/pci-layerscape.c | 34 ++---
>  1 file changed, 17 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pci-layerscape.c 
> b/drivers/pci/controller/dwc/pci-layerscape.c
> index 6f47cfe146c44..4b663b20d8612 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> @@ -46,7 +46,7 @@
>  #define LS_PCIE_DRV_SCFG BIT(0)
>  
>  struct ls_pcie_drvdata {
> - const u32 pf_off;
> + const u32 pf_lut_off;
>   const struct dw_pcie_host_ops *ops;
>   int (*exit_from_l2)(struct dw_pcie_rp *pp);
>   int flags;
> @@ -56,13 +56,13 @@ struct ls_pcie_drvdata {
>  struct ls_pcie {
>   struct dw_pcie *pci;
>   const struct ls_pcie_drvdata *drvdata;
> - void __iomem *pf_base;
> + void __iomem *pf_lut_base;
>   struct regmap *scfg;
>   int index;
>   bool big_endian;
>  };
>  
> -#define ls_pcie_pf_readl_addr(addr)  ls_pcie_pf_readl(pcie, addr)
> +#define ls_pcie_pf_lut_readl_addr(addr)  ls_pcie_pf_lut_readl(pcie, addr)
>  #define to_ls_pcie(x)dev_get_drvdata((x)->dev)
>  
>  static bool ls_pcie_is_bridge(struct ls_pcie *pcie)
> @@ -103,20 +103,20 @@ static void ls_pcie_fix_error_response(struct ls_pcie 
> *pcie)
>   iowrite32(PCIE_ABSERR_SETTING, pci->dbi_base + PCIE_ABSERR);
>  }
>  
> -static u32 ls_pcie_pf_readl(struct ls_pcie *pcie, u32 off)
> +static u32 ls_pcie_pf_lut_readl(struct ls_pcie *pcie, u32 off)
>  {
>   if (pcie->big_endian)
> - return ioread32be(pcie->pf_base + off);
> + return ioread32be(pcie->pf_lut_base + off);
>  
> - return ioread32(pcie->pf_base + off);
> + return ioread32(pcie->pf_lut_base + off);
>  }
>  
> -static void ls_pcie_pf_writel(struct ls_pcie *pcie, u32 off, u32 val)
> +static void ls_pcie_pf_lut_writel(struct ls_pcie *pcie, u32 off, u32 val)
>  {
>   if (pcie->big_endian)
> - iowrite32be(val, pcie->pf_base + off);
> + iowrite32be(val, pcie->pf_lut_base + off);
>   else
> - iowrite32(val, pcie->pf_base + off);
> + iowrite32(val, pcie->pf_lut_base + off);
>  }
>  
>  static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp *pp)
> @@ -126,11 +126,11 @@ static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp 
> *pp)
>   u32 val;
>   int ret;
>  
> - val = ls_pcie_pf_readl(pcie, LS_PCIE_PF_MCR);
> + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_PF_MCR);
>   val |= PF_MCR_PTOMR;
> - ls_pcie_pf_writel(pcie, LS_PCIE_PF_MCR, val);
> + ls_pcie_pf_lut_writel(pcie, LS_PCIE_PF_MCR, val);
>  
> - ret = readx_poll_timeout(ls_pcie_pf_readl_addr, LS_PCIE_PF_MCR,
> + ret = readx_poll_timeout(ls_pcie_pf_lut_readl_addr, LS_PCIE_PF_MCR,
>val, !(val & PF_MCR_PTOMR),
>PCIE_PME_TO_L2_TIMEOUT_US/10,
>PCIE_PME_TO_L2_TIMEOUT_US);
> @@ -149,15 +149,15 @@ static int ls_pcie_exit_from_l2(struct dw_pcie_rp *pp)
>* Set PF_MCR_EXL2S bit in LS_PCIE_PF_MCR register for the link
>* to exit L2 state.
>*/
> - val = ls_pcie_pf_readl(pcie, LS_PCIE_PF_MCR);
> + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_PF_MCR);
>   val |= PF_MCR_EXL2S;
> - ls_pcie_pf_writel(pcie, LS_PCIE_PF_MCR, val);
> + ls_pcie_pf_lut_writel(pcie, LS_PCIE_PF_MCR, val);
>  
>   /*
>* L2 exit timeout of 10ms is not defined in the specifications,
>* it was chosen based on empirical observations.
>*/
> - ret = readx_poll_timeout(ls_pcie_pf_readl_addr, LS_PCIE_PF_MCR,
> + ret = readx_poll_timeout(ls_pcie_pf_lut_readl_addr, LS_PCIE_PF_MCR,
>val, !(val & PF_MCR_EXL2S),
>1000,
>1);
> @@ -245,7 +245,7 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = {
>  };
>  
>  static const struct ls_pcie_drvdata layerscape_drvdata = {
> - .pf_off = 0xc,
> + .pf_lut_off = 0xc,
>   .pm_support = true,
>   .exit_from_l2 = ls_pcie_exit_from_l2,
>  };
> @@ -295,7 +295,7 @@ static int ls_pcie_probe(struct platform_device *pdev)
>  
>   pcie->big_endian = of_property_read_bool(dev->of_node, "big-endian");
>  
> - pcie->pf_base = pci->dbi_base + pcie->drvdata->pf_off;
> + pcie->pf_lut_base = pci->dbi_base + pcie->drvdata->pf_lut_off;
>  
>   if (pcie->drvdata->flags & LS_PCIE_DRV_SC

Re: [PATCH v3 4/4] PCI: layerscape: Add suspend/resume for ls1043a

2023-11-02 Thread Manivannan Sadhasivam
On Tue, Oct 17, 2023 at 03:31:45PM -0400, Frank Li wrote:
> ls1043a add suspend/resume support.
> Implement ls1043a_pcie_send_turnoff_msg() to send PME_Turn_Off message.
> Implement ls1043a_pcie_exit_from_l2() to exit from L2 state.
> 

Please use the suggestion I gave in patch 2/4.

> Signed-off-by: Frank Li 
> ---
> 
> Notes:
> Change from v2 to v3
> - Remove ls_pcie_lut_readl(writel) function
> 
> Change from v1 to v2
> - Update subject 'a' to 'A'
> 
>  drivers/pci/controller/dwc/pci-layerscape.c | 86 -
>  1 file changed, 85 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/controller/dwc/pci-layerscape.c 
> b/drivers/pci/controller/dwc/pci-layerscape.c
> index 4b663b20d8612..9656224960b0c 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> @@ -41,6 +41,15 @@
>  #define SCFG_PEXSFTRSTCR 0x190
>  #define PEXSR(idx)   BIT(idx)
>  
> +/* LS1043A PEX PME control register */
> +#define SCFG_PEXPMECR0x144
> +#define PEXPME(idx)  BIT(31 - (idx) * 4)
> +
> +/* LS1043A PEX LUT debug register */
> +#define LS_PCIE_LDBG 0x7fc
> +#define LDBG_SR  BIT(30)
> +#define LDBG_WE  BIT(31)
> +
>  #define PCIE_IATU_NUM6
>  
>  #define LS_PCIE_DRV_SCFG BIT(0)
> @@ -227,6 +236,68 @@ static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp 
> *pp)
>   return 0;
>  }
>  
> +static void ls1043a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp)
> +{
> + struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> + struct ls_pcie *pcie = to_ls_pcie(pci);
> + u32 val;
> +
> + if (!pcie->scfg) {
> + dev_dbg(pcie->pci->dev, "SYSCFG is NULL\n");
> + return;
> + }

Why scfg is optional for this SoC and not for the other one added in patch 2/4?

> +
> + /* Send Turn_off message */
> + regmap_read(pcie->scfg, SCFG_PEXPMECR, &val);
> + val |= PEXPME(pcie->index);
> + regmap_write(pcie->scfg, SCFG_PEXPMECR, val);
> +

In my previous review, I asked you to use a common function and just pass the
offsets, as the sequence is same for both the SoCs. But you ignored it :/

> + /*
> +  * There is no specific register to check for PME_To_Ack from endpoint.
> +  * So on the safe side, wait for PCIE_PME_TO_L2_TIMEOUT_US.
> +  */
> + mdelay(PCIE_PME_TO_L2_TIMEOUT_US/1000);
> +
> + /*
> +  * Layerscape hardware reference manual recommends clearing the 
> PMXMTTURNOFF bit
> +  * to complete the PME_Turn_Off handshake.
> +  */
> + regmap_read(pcie->scfg, SCFG_PEXPMECR, &val);
> + val &= ~PEXPME(pcie->index);
> + regmap_write(pcie->scfg, SCFG_PEXPMECR, val);
> +}
> +
> +static int ls1043a_pcie_exit_from_l2(struct dw_pcie_rp *pp)
> +{
> + struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
> + struct ls_pcie *pcie = to_ls_pcie(pci);
> + u32 val;
> +
> + /*
> +  * Only way let PEX module exit L2 is do a software reset.

Same comment applies as patch 2/4.

- Mani

-- 
மணிவண்ணன் சதாசிவம்


Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-02 Thread Paolo Bonzini
On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson  wrote:
> Actually, looking that this again, there's not actually a hard dependency on 
> THP.
> A THP-enabled kernel _probably_  gives a higher probability of using 
> hugepages,
> but mostly because THP selects COMPACTION, and I suppose because using THP for
> other allocations reduces overall fragmentation.

Yes, that's why I didn't even bother enabling it unless THP is
enabled, but it makes even more sense to just try.

> So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I 
> think
> we should do the below (I verified KVM can create hugepages with THP=n).  
> We'll
> need another capability, but (a) we probably should have that anyways and (b) 
> it
> provides a cleaner path to adding PUD-sized hugepage support in the future.

I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This
should be a generic kernel API and in fact the sizes are available in
a not-so-friendly format in /sys/kernel/mm/hugepages.

We should just add /sys/kernel/mm/hugepages/sizes that contains
"2097152 1073741824" on x86 (only the former if 1G pages are not
supported).

Plus: is this the best API if we need something else for 1G pages?

Let's drop *this* patch and proceed incrementally. (Again, this is
what I want to do with this final review: identify places that are
stil sticky, and don't let them block the rest).

Coincidentially we have an open spot next week at plumbers. Let's
extend Fuad's section to cover more guestmem work.

Paolo

> diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
> b/tools/testing/selftests/kvm/guest_memfd_test.c
> index c15de9852316..c9f449718fce 100644
> --- a/tools/testing/selftests/kvm/guest_memfd_test.c
> +++ b/tools/testing/selftests/kvm/guest_memfd_test.c
> @@ -201,6 +201,10 @@ int main(int argc, char *argv[])
>
> TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
>
> +   if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && 
> thp_configured())
> +   TEST_ASSERT_EQ(get_trans_hugepagesz(),
> +  
> kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE));
> +
> page_size = getpagesize();
> total_size = page_size * 4;
>
> diff --git 
> a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c 
> b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> index be311944e90a..245901587ed2 100644
> --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
> @@ -396,7 +396,7 @@ static void test_mem_conversions(enum 
> vm_mem_backing_src_type src_type, uint32_t
>
> vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << 
> KVM_HC_MAP_GPA_RANGE));
>
> -   if (backing_src_can_be_huge(src_type))
> +   if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE))
> memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> else
> memfd_flags = 0;
>
> --
> From: Sean Christopherson 
> Date: Wed, 25 Oct 2023 16:26:41 -0700
> Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest
>  memory
>
> Extend guest_memfd to allow backing guest memory with hugepages.  For now,
> make hugepage utilization best-effort, i.e. fall back to non-huge mappings
> if a hugepage can't be allocated.  Guaranteeing hugepages would require a
> dedicated memory pool and significantly more complexity and churn..
>
> Require userspace to opt-in via a flag even though it's unlikely real use
> cases will ever want to use order-0 pages, e.g. to give userspace a safety
> valve in case hugepage support is buggy, and to allow for easier testing
> of both paths.
>
> Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling
> primarily deals with userspace page tables, which are explicitly not in
> play for guest_memfd.  Selecting THP does make obtaining hugepages more
> likely, but only because THP selects CONFIG_COMPACTION.  Don't select
> CONFIG_COMPACTION either, because again it's not a hard dependency.
>
> For simplicity, require the guest_memfd size to be a multiple of the
> hugepage size, e.g. so that KVM doesn't need to do bounds checking when
> deciding whether or not to allocate a huge folio.
>
> When reporting the max order when KVM gets a pfn from guest_memfd, force
> order-0 pages if the hugepage is not fully contained by the memslot
> binding, e.g. if userspace requested hugepages but punches a hole in the
> memslot bindings in order to emulate x86's VGA hole.
>
> Signed-off-by: Sean Christopherson 
> ---
>  Documentation/virt/kvm/api.rst | 17 +
>  include/uapi/linux/kvm.h   |  3 ++
>  virt/kvm/guest_memfd.c | 69 +-
>  virt/kvm/kvm_main.c|  4 ++
>  4 files changed, 84 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index e82c69d5e755..ccdd5413920d 100644
> --- a/Documen

Re: [PATCH v3 1/4] PCI: layerscape: Add function pointer for exit_from_l2()

2023-11-02 Thread Manivannan Sadhasivam
On Tue, Oct 17, 2023 at 03:31:42PM -0400, Frank Li wrote:
> Since difference SoCs require different sequence for exiting L2, let's add
> a separate "exit_from_l2()" callback. This callback can be used to execute
> SoC specific sequence.
> 

I missed the fact that this patch honors the return value of the callback (which
was ignored previously). So this should be added to the description as well.

> Signed-off-by: Frank Li 

With that,

Reviewed-by: Manivannan Sadhasivam 

- Mani

> ---
> 
> Notes:
> Change from v2 to v3
> - fixed according to mani's feedback
>   1. update commit message
>   2. move dw_pcie_host_ops to next patch
>   3. check return value from exit_from_l2()
> Change from v1 to v2
> - change subject 'a' to 'A'
> 
> Change from v1 to v2
> - change subject 'a' to 'A'
> 
>  drivers/pci/controller/dwc/pci-layerscape.c | 11 +--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pci-layerscape.c 
> b/drivers/pci/controller/dwc/pci-layerscape.c
> index 37956e09c65bd..aea89926bcc4f 100644
> --- a/drivers/pci/controller/dwc/pci-layerscape.c
> +++ b/drivers/pci/controller/dwc/pci-layerscape.c
> @@ -39,6 +39,7 @@
>  
>  struct ls_pcie_drvdata {
>   const u32 pf_off;
> + int (*exit_from_l2)(struct dw_pcie_rp *pp);
>   bool pm_support;
>  };
>  
> @@ -125,7 +126,7 @@ static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp 
> *pp)
>   dev_err(pcie->pci->dev, "PME_Turn_off timeout\n");
>  }
>  
> -static void ls_pcie_exit_from_l2(struct dw_pcie_rp *pp)
> +static int ls_pcie_exit_from_l2(struct dw_pcie_rp *pp)
>  {
>   struct dw_pcie *pci = to_dw_pcie_from_pp(pp);
>   struct ls_pcie *pcie = to_ls_pcie(pci);
> @@ -150,6 +151,8 @@ static void ls_pcie_exit_from_l2(struct dw_pcie_rp *pp)
>1);
>   if (ret)
>   dev_err(pcie->pci->dev, "L2 exit timeout\n");
> +
> + return ret;
>  }
>  
>  static int ls_pcie_host_init(struct dw_pcie_rp *pp)
> @@ -180,6 +183,7 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = {
>  static const struct ls_pcie_drvdata layerscape_drvdata = {
>   .pf_off = 0xc,
>   .pm_support = true,
> + .exit_from_l2 = ls_pcie_exit_from_l2,
>  };
>  
>  static const struct of_device_id ls_pcie_of_match[] = {
> @@ -247,11 +251,14 @@ static int ls_pcie_suspend_noirq(struct device *dev)
>  static int ls_pcie_resume_noirq(struct device *dev)
>  {
>   struct ls_pcie *pcie = dev_get_drvdata(dev);
> + int ret;
>  
>   if (!pcie->drvdata->pm_support)
>   return 0;
>  
> - ls_pcie_exit_from_l2(&pcie->pci->pp);
> + ret = pcie->drvdata->exit_from_l2(&pcie->pci->pp);
> + if (ret)
> + return ret;
>  
>   return dw_pcie_resume_noirq(pcie->pci);
>  }
> -- 
> 2.34.1
> 

-- 
மணிவண்ணன் சதாசிவம்


Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM

2023-11-02 Thread Christian Brauner
On Fri, Oct 27, 2023 at 11:21:57AM -0700, Sean Christopherson wrote:
> Export anon_inode_getfile_secure() so that it can be used by KVM to create
> and manage file-based guest memory without need a fullblow filesystem.
> The "standard" anon_inode_getfd() doesn't work for KVM's use case as KVM
> needs a unique inode for each file, e.g. to be able to independently
> manage the size and lifecycle of a given file.
> 
> Note, KVM doesn't need a "secure" version, just unique inodes, i.e. ignore
> the name.
> 
> Signed-off-by: Sean Christopherson 
> ---

Before we enshrine this misleading name let's rename this to:

create_anon_inode_getfile()

I don't claim it's a great name but it's better than *_secure() which is
very confusing. So just:

struct file *create_anon_inode_getfile(const char *name,
   const struct file_operations *fops,
   void *priv, int flags)

May also just remove that context_inode argument from the exported
function. The only other caller is io_uring. And neither it nor this
patchset need the context_inode thing afaict. Merge conflict risk is
extremely low so carrying that as part of this patchset is fine and
shouldn't cause huge issues for you.


Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-02 Thread David Matlack
On Thu, Nov 2, 2023 at 9:03 AM Sean Christopherson  wrote:
>
> On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> > On 10/31/23 23:39, David Matlack wrote:
> > > > > Maybe can you sketch out how you see this proposal being extensible to
> > > > > using guest_memfd for shared mappings?
> > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is 
> > > > needed.  What's
> > > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM 
> > > > needs to
> > > > ensure there are no outstanding references when converting back to 
> > > > private.
> > > >
> > > > For TDX/SNP, assuming we don't find a performant and robust way to do 
> > > > in-place
> > > > conversions, a second fd+offset pair would be needed.
> > > Is there a way to support non-in-place conversions within a single 
> > > guest_memfd?
> >
> > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest
> > memory.  The hook would invalidate now-private parts if they have a VMA,
> > causing a SIGSEGV/EFAULT if the host touches them.
> >
> > It would forbid mappings from multiple gfns to a single offset of the
> > guest_memfd, because then the shared vs. private attribute would be tied to
> > the offset.  This should not be a problem; for example, in the case of SNP,
> > the RMP already requires a single mapping from host physical address to
> > guest physical address.
>
> I don't see how this can work.  It's not a M:1 scenario (where M is multiple 
> gfns),
> it's a 1:N scenario (wheren N is multiple offsets).  The *gfn* doesn't change 
> on
> a conversion, what needs to change to do non-in-place conversion is the pfn, 
> which
> is effectively the guest_memfd+offset pair.
>
> So yes, we *could* support non-in-place conversions within a single 
> guest_memfd,
> but it would require a second offset,

Why can't KVM free the existing page at guest_memfd+offset and
allocate a new one when doing non-in-place conversions?

> at which point it makes sense to add a
> second file descriptor as well.  Userspace could still use a single 
> guest_memfd
> instance, i.e. pass in the same file descriptor but different offsets.


Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Sean Christopherson
On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> On 11/2/23 10:35, Huang, Kai wrote:
> > IIUC KVM can already handle the case of poisoned
> > page by sending signal to user app:
> > 
> > static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu,  
> > struct
> > kvm_page_fault *fault)  
> > {
> > ...
> > 
> > if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
> > kvm_send_hwpoison_signal(fault->slot, 
> > fault->gfn);

No, this doesn't work, because that signals the host virtual address

unsigned long hva = gfn_to_hva_memslot(slot, gfn);

send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current);

which is the *shared* page.

> > return RET_PF_RETRY;
> > }
> > }
> 
> EHWPOISON is not implemented by this series, so it should be left out of the
> documentation.

EHWPOISON *is* implemented.  kvm_gmem_get_pfn() returns -EWPOISON as 
appropriate,
and kvm_faultin_pfn() returns that directly without going through 
kvm_handle_error_pfn().

  kvm_faultin_pfn_private()
  |
  |-> kvm_gmem_get_pfn()
  |
  |-> if (folio_test_hwpoison(folio)) {
r = -EHWPOISON;
goto out_unlock;
  }

  |
  |->   r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, 
&fault->pfn,
 &max_order);
if (r) {
kvm_mmu_prepare_memory_fault_exit(vcpu, fault);
return r;
}

|
|-> ret = __kvm_faultin_pfn(vcpu, fault);
if (ret != RET_PF_CONTINUE)
return ret;

if (unlikely(is_error_pfn(fault->pfn)))
return kvm_handle_error_pfn(vcpu, fault);


Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Sean Christopherson
On Thu, Nov 02, 2023, Xiaoyao Li wrote:
> On 11/2/2023 1:36 AM, Sean Christopherson wrote:
> > > KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function 
> > > to
> > > ?
> > I'd prefer to keep it in generic code, as it's highly likely to end up there
> > sooner than later.  There's a known use case for ARM (exit to userspace on 
> > missing
> > userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize 
> > this API.
> > 
> > [*]https://lore.kernel.org/all/20230908222905.1321305-8-amoor...@google.com
> 
> I wonder how this CAP is supposed to be checked in userspace, for guest
> memfd case? 

It's basically useless for guest_memfd.

>   if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO) &&
>   run->exit_reason == KVM_EXIT_MEMORY_FAULT)
>   abort("unexpected KVM_EXIT_MEMORY_FAULT");
> 
> In my implementation of QEMU patches, I find it's unnecessary. When
> userspace gets an exit with KVM_EXIT_MEMORY_FAULT, it implies
> "KVM_CAP_MEMORY_FAULT_INFO".
> 
> So I don't see how it is necessary in this series. Whether it's necessary or
> not for [*], I don't have the answer but we can leave the discussion to that
> patch series.

It's not strictly necessary there either.

However, Oliver felt (and presumably still feels) quite strongly, and I agree,
that neither reporting extra information shouldn't be tightly coupled to
KVM_CAP_EXIT_ON_MISSING or KVM_CAP_GUEST_MEMFD.

E.g. if userspace develops a "standalone" use case for 
KVM_CAP_MEMORY_FAULT_INFO,
userspace should be able to check for support without having to take a 
dependency
on KVM_CAP_GUEST_MEMFD, especially since because KVM_CAP_GUEST_MEMFD may not be
supported, i.e. userspace should be able to do:

if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO))
abort("KVM_CAP_MEMORY_FAULT_INFO required for fancy feature 
XYZ");




Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-02 Thread Paolo Bonzini

On 10/31/23 23:39, David Matlack wrote:

Maybe can you sketch out how you see this proposal being extensible to
using guest_memfd for shared mappings?

For in-place conversions, e.g. pKVM, no additional guest_memfd is needed.  
What's
missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs to
ensure there are no outstanding references when converting back to private.

For TDX/SNP, assuming we don't find a performant and robust way to do in-place
conversions, a second fd+offset pair would be needed.

Is there a way to support non-in-place conversions within a single guest_memfd?


For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to 
guest memory.  The hook would invalidate now-private parts if they have 
a VMA, causing a SIGSEGV/EFAULT if the host touches them.


It would forbid mappings from multiple gfns to a single offset of the 
guest_memfd, because then the shared vs. private attribute would be tied 
to the offset.  This should not be a problem; for example, in the case 
of SNP, the RMP already requires a single mapping from host physical 
address to guest physical address.


Paolo



Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Sean Christopherson
On Thu, Nov 02, 2023, Kai Huang wrote:
> On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote:
> > On Wed, Nov 01, 2023, Kai Huang wrote:
> > > 
> > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO
> > > > +--
> > > > +
> > > > +:Architectures: x86
> > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> > > > +
> > > > +The presence of this capability indicates that KVM_RUN will fill
> > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, 
> > > > e.g. if
> > > > +there is a valid memslot but no backing VMA for the corresponding host 
> > > > virtual
> > > > +address.
> > > > +
> > > > +The information in kvm_run.memory_fault is valid if and only if 
> > > > KVM_RUN returns
> > > > +an error with errno=EFAULT or errno=EHWPOISON *and* 
> > > > kvm_run.exit_reason is set
> > > > +to KVM_EXIT_MEMORY_FAULT.
> > > 
> > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal
> > > implementation.
> > 
> > The errno that is returned to userspace is ABI.  In KVM, it's a _very_ 
> > poorly
> > defined ABI for the vast majority of ioctls(), but it's still technically 
> > ABI.
> > KVM gets away with being cavalier with errno because the vast majority of 
> > errors
> > are considered fatal by userespace, i.e. in most cases, userspace simply 
> > doesn't
> > care about the exact errno.
> > 
> > A good example is KVM_RUN with -EINTR; if KVM were to return something 
> > other than
> > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would 
> > fall over.
> > 
> > > Is it better to relax the validity of kvm_run.memory_fault when
> > > KVM_RUN returns any -errno?
> > 
> > Not unless there's a need to do so, and if there is then we can update the
> > documentation accordingly.  If KVM's ABI is that kvm_run.memory_fault is 
> > valid
> > for any errno, then KVM would need to purge kvm_run.exit_reason super early 
> > in
> > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being
> > misinterpreted as KVM_EXIT_MEMORY_FAULT.  And purging exit_reason super 
> > early is
> > subtly tricky because KVM's (again, poorly documented) ABI is that *some* 
> > exit
> > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or 
> > with a
> > pending signal).
> > 
> > https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com
> > 
> > 
> 
> Agreed with not to relax to any errno.  However using -EFAULT as part of ABI
> definition seems a little bit dangerous, e.g., someone could accidentally or
> mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely
> different code path, etc.  -EINTR has well defined meaning, but -EFAULT (which
> is "Bad address") seems doesn't but I am not sure either. :-)

KVM has returned -EFAULT since forever, i.e. it's effectively already part of 
the
ABI.  I doubt there's a userspace that relies precisely on -EFAULT, but 
userspace
definitely will be confused if KVM returns '0' where KVM used to return -EFAULT.
And so if we want to return '0', it needs to be opt-in, which means forcing
userspace to enable a capability *and* requires code in KVM to conditionally 
return
'0' instead of -EFAULT/-EHWPOISON.

> One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns
> KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc
> fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just
> return -EFAULT.  If kvm_run.exit_reason isn't purged early then is it possible
> to have some issue here?

Well, yeah, but that's exactly why this series has a patch to reset exit_reason.
The solution to "if KVM is buggy then bad things happen" is to not have KVM 
bugs :-)


Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory

2023-11-02 Thread Sean Christopherson
On Thu, Nov 02, 2023, Paolo Bonzini wrote:
> On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson  wrote:
> >
> > On Wed, Nov 01, 2023, Paolo Bonzini wrote:
> > > On 11/1/23 17:36, Sean Christopherson wrote:
> > > > Can you post a fixup patch?  It's not clear to me exactly what behavior 
> > > > you intend
> > > > to end up with.
> > >
> > > Sure, just this:
> > >
> > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > > index 7d1a33c2ad42..34fd070e03d9 100644
> > > --- a/virt/kvm/guest_memfd.c
> > > +++ b/virt/kvm/guest_memfd.c
> > > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> > > kvm_create_guest_memfd *args)
> > >  {
> > >   loff_t size = args->size;
> > >   u64 flags = args->flags;
> > > - u64 valid_flags = 0;
> > > -
> > > - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
> > > - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
> > >   if (flags & ~valid_flags)
> > >   return -EINVAL;
> > > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct 
> > > kvm_create_guest_memfd *args)
> > >   if (size < 0 || !PAGE_ALIGNED(size))
> > >   return -EINVAL;
> > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > >   if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> > >   !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> > >   return -EINVAL;
> > > -#endif
> >
> > That won't work, HPAGE_PMD_SIZE is valid only for 
> > CONFIG_TRANSPARENT_HUGEPAGE=y.
> >
> > #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> > #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> > #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
> > #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; })
> 
> Would have caught it when actually testing it, I guess. :) It has to
> be PMD_SIZE, possibly with
> 
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE);
> #endif

Yeah, that works for me.

Actually, looking that this again, there's not actually a hard dependency on 
THP.
A THP-enabled kernel _probably_  gives a higher probability of using hugepages,
but mostly because THP selects COMPACTION, and I suppose because using THP for
other allocations reduces overall fragmentation.

So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think
we should do the below (I verified KVM can create hugepages with THP=n).  We'll
need another capability, but (a) we probably should have that anyways and (b) it
provides a cleaner path to adding PUD-sized hugepage support in the future.

And then adjust the tests like so:

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c 
b/tools/testing/selftests/kvm/guest_memfd_test.c
index c15de9852316..c9f449718fce 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -201,6 +201,10 @@ int main(int argc, char *argv[])
 
TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
 
+   if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && 
thp_configured())
+   TEST_ASSERT_EQ(get_trans_hugepagesz(),
+  
kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE));
+
page_size = getpagesize();
total_size = page_size * 4;
 
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c 
b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
index be311944e90a..245901587ed2 100644
--- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -396,7 +396,7 @@ static void test_mem_conversions(enum 
vm_mem_backing_src_type src_type, uint32_t
 
vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
 
-   if (backing_src_can_be_huge(src_type))
+   if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE))
memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE;
else
memfd_flags = 0;

--
From: Sean Christopherson 
Date: Wed, 25 Oct 2023 16:26:41 -0700
Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest
 memory

Extend guest_memfd to allow backing guest memory with hugepages.  For now,
make hugepage utilization best-effort, i.e. fall back to non-huge mappings
if a hugepage can't be allocated.  Guaranteeing hugepages would require a
dedicated memory pool and significantly more complexity and churn..

Require userspace to opt-in via a flag even though it's unlikely real use
cases will ever want to use order-0 pages, e.g. to give userspace a safety
valve in case hugepage support is buggy, and to allow for easier testing
of both paths.

Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling
primarily deals with userspace page tables, which are explicitly not in
play for guest_memfd.  Selecting THP does make obtaining hugepages more
likely, but only because THP selects CONFIG_COMPACTION.  Don't select
CONFIG_COMPACTION eith

Re: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM

2023-11-02 Thread Fuad Tabba
On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson  wrote:
>
> Let x86 track the number of address spaces on a per-VM basis so that KVM
> can disallow SMM memslots for confidential VMs.  Confidentials VMs are
> fundamentally incompatible with emulating SMM, which as the name suggests
> requires being able to read and write guest memory and register state.
>
> Disallowing SMM will simplify support for guest private memory, as KVM
> will not need to worry about tracking memory attributes for multiple
> address spaces (SMM is the only "non-default" address space across all
> architectures).
>
> Signed-off-by: Sean Christopherson 
> ---

Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 

Cheers,
/fuad

>  arch/powerpc/kvm/book3s_hv.c|  2 +-
>  arch/x86/include/asm/kvm_host.h |  8 +++-
>  arch/x86/kvm/debugfs.c  |  2 +-
>  arch/x86/kvm/mmu/mmu.c  |  6 +++---
>  arch/x86/kvm/x86.c  |  2 +-
>  include/linux/kvm_host.h| 17 +++--
>  virt/kvm/dirty_ring.c   |  2 +-
>  virt/kvm/kvm_main.c | 26 ++
>  8 files changed, 39 insertions(+), 26 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 130bafdb1430..9b0eaa17275a 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -6084,7 +6084,7 @@ static int kvmhv_svm_off(struct kvm *kvm)
> }
>
> srcu_idx = srcu_read_lock(&kvm->srcu);
> -   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> struct kvm_memory_slot *memslot;
> struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> int bkt;
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 6702f795c862..f9e8d5642069 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2124,9 +2124,15 @@ enum {
>  #define HF_SMM_MASK(1 << 1)
>  #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
>
> -# define KVM_ADDRESS_SPACE_NUM 2
> +# define KVM_MAX_NR_ADDRESS_SPACES 2
>  # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK 
> ? 1 : 0)
>  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 
> (role).smm)
> +
> +static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm)
> +{
> +   return KVM_MAX_NR_ADDRESS_SPACES;
> +}
> +
>  #else
>  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0)
>  #endif
> diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
> index ee8c4c3496ed..42026b3f3ff3 100644
> --- a/arch/x86/kvm/debugfs.c
> +++ b/arch/x86/kvm/debugfs.c
> @@ -111,7 +111,7 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, 
> void *v)
> mutex_lock(&kvm->slots_lock);
> write_lock(&kvm->mmu_lock);
>
> -   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> int bkt;
>
> slots = __kvm_memslots(kvm, i);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c4e758f0aebb..baeba8fc1c38 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3755,7 +3755,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
> kvm_page_track_write_tracking_enabled(kvm))
> goto out_success;
>
> -   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> slots = __kvm_memslots(kvm, i);
> kvm_for_each_memslot(slot, bkt, slots) {
> /*
> @@ -6294,7 +6294,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, 
> gfn_t gfn_start, gfn_t gfn_e
> if (!kvm_memslots_have_rmaps(kvm))
> return flush;
>
> -   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +   for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> slots = __kvm_memslots(kvm, i);
>
> kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, 
> gfn_end) {
> @@ -6791,7 +6791,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 
> gen)
>  * modifier prior to checking for a wrap of the MMIO generation so
>  * that a wrap in any address space is detected.
>  */
> -   gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1);
> +   gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1);
>
> /*
>  * The very rare case: if the MMIO generation number has wrapped,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 824b58b44382..c4d17727b199 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12456,7 +12456,7 @@ void __user * __x86_set_memory_region(struct kvm 
> *kvm, int id, gpa_t gpa,
> hva = slot->userspace_addr;
> }
>
> -   for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +   for (i = 0; i < kvm_arch_nr_memslot_as_

Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-11-02 Thread Fuad Tabba
On Thu, Nov 2, 2023 at 2:41 PM Sean Christopherson  wrote:
>
> On Thu, Nov 02, 2023, Fuad Tabba wrote:
> > Hi,
> >
> > On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson  
> > wrote:
> > >
> > > Add flags to "struct kvm_gfn_range" to let notifier events target only
> > > shared and only private mappings, and write up the existing mmu_notifier
> > > events to be shared-only (private memory is never associated with a
> > > userspace virtual address, i.e. can't be reached via mmu_notifiers).
> > >
> > > Add two flags so that KVM can handle the three possibilities (shared,
> > > private, and shared+private) without needing something like a tri-state
> > > enum.
> > >
> > > Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com
> > > Signed-off-by: Sean Christopherson 
> > > ---
> > >  include/linux/kvm_host.h | 2 ++
> > >  virt/kvm/kvm_main.c  | 7 +++
> > >  2 files changed, 9 insertions(+)
> > >
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index 96aa930536b1..89c1a991a3b8 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -263,6 +263,8 @@ struct kvm_gfn_range {
> > > gfn_t start;
> > > gfn_t end;
> > > union kvm_mmu_notifier_arg arg;
> > > +   bool only_private;
> > > +   bool only_shared;
> >
> > If these flags aren't used in this patch series, should this patch be
> > moved to the other series?
>
> If *both* TDX and SNP need this patch, then I think it's probably worth 
> applying
> it now to make their lives easier.  But if only one needs the support, then I
> completely agree this should be punted to whichever series needs it (this also
> came up in v11, but we didn't force the issue).
>
> Mike, Isaku?
>
> > Also, if shared+private is a possibility, doesn't the prefix "only_"
> > confuse things a bit? I.e., what is shared+private, is it when both
> > are 0 or when both are 1? I assume it's the former (both are 0), but
> > it might be clearer.
>
> Heh, I was hoping that "only_private && only_shared" would be obviously 
> nonsensical.
>
> The only alternative I can think would be to add an enum, e.g.
>
> enum {
> PROCESS_PRIVATE_AND_SHARED,
> PROCESS_ONLY_PRIVATE,
> PROCESS_ONLY_SHARED,
> };
>
> because every other way of expressing the flags either results in more 
> confusion
> or an unsafe default.  I.e. I want zapping only private or only shared to 
> require
> the caller to explicitly set a non-zero value, which is how I ended up with
> "only_{private,shared}" as opposed to "process_{private,shared}".

I don't have a strong opinion about this. Having an enum looks good to me.

Cheers,
/fuad


Fbdev issue after the drm updates 'drm-next-2023-10-31-1'

2023-11-02 Thread Christian Zigotzky

Hello,

There is a fbdev issue with the virtio-gpu-pci and virtio-vga. (The 
penguins are not displayed at boot time)


Error message:  [    0.889302] virtio-pci :00:02.0: [drm] *ERROR* 
fbdev: Failed to setup generic emulation (ret=-2)


The kernel 6.6 final doesn't have this issue.

Please check the fbdev changes in the drm updates 'drm-next-2023-10-31-1'.

Thanks,
Christian


Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-11-02 Thread Sean Christopherson
On Thu, Nov 02, 2023, Fuad Tabba wrote:
> Hi,
> 
> On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson  wrote:
> >
> > Add flags to "struct kvm_gfn_range" to let notifier events target only
> > shared and only private mappings, and write up the existing mmu_notifier
> > events to be shared-only (private memory is never associated with a
> > userspace virtual address, i.e. can't be reached via mmu_notifiers).
> >
> > Add two flags so that KVM can handle the three possibilities (shared,
> > private, and shared+private) without needing something like a tri-state
> > enum.
> >
> > Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com
> > Signed-off-by: Sean Christopherson 
> > ---
> >  include/linux/kvm_host.h | 2 ++
> >  virt/kvm/kvm_main.c  | 7 +++
> >  2 files changed, 9 insertions(+)
> >
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 96aa930536b1..89c1a991a3b8 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -263,6 +263,8 @@ struct kvm_gfn_range {
> > gfn_t start;
> > gfn_t end;
> > union kvm_mmu_notifier_arg arg;
> > +   bool only_private;
> > +   bool only_shared;
> 
> If these flags aren't used in this patch series, should this patch be
> moved to the other series?

If *both* TDX and SNP need this patch, then I think it's probably worth applying
it now to make their lives easier.  But if only one needs the support, then I
completely agree this should be punted to whichever series needs it (this also
came up in v11, but we didn't force the issue).

Mike, Isaku?

> Also, if shared+private is a possibility, doesn't the prefix "only_"
> confuse things a bit? I.e., what is shared+private, is it when both
> are 0 or when both are 1? I assume it's the former (both are 0), but
> it might be clearer.

Heh, I was hoping that "only_private && only_shared" would be obviously 
nonsensical.

The only alternative I can think would be to add an enum, e.g.

enum {
PROCESS_PRIVATE_AND_SHARED,
PROCESS_ONLY_PRIVATE,
PROCESS_ONLY_SHARED,
};

because every other way of expressing the flags either results in more confusion
or an unsafe default.  I.e. I want zapping only private or only shared to 
require
the caller to explicitly set a non-zero value, which is how I ended up with
"only_{private,shared}" as opposed to "process_{private,shared}".


Re: [PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro

2023-11-02 Thread Fuad Tabba
On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson  wrote:
>
> Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of
> KVM_ADDRESS_SPACE_NUM.
>
> No functional change intended.
>
> Reviewed-by: Paolo Bonzini 
> Signed-off-by: Sean Christopherson 
> ---

Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 

Cheers,
/fuad

>  arch/x86/include/asm/kvm_host.h | 1 -
>  include/linux/kvm_host.h| 2 +-
>  2 files changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8d60e4745e8b..6702f795c862 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2124,7 +2124,6 @@ enum {
>  #define HF_SMM_MASK(1 << 1)
>  #define HF_SMM_INSIDE_NMI_MASK (1 << 2)
>
> -# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
>  # define KVM_ADDRESS_SPACE_NUM 2
>  # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK 
> ? 1 : 0)
>  # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 
> (role).smm)
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index e3223cafd7db..c3cfe08b1300 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -692,7 +692,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm);
>  #define KVM_MEM_SLOTS_NUM SHRT_MAX
>  #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS)
>
> -#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE
> +#if KVM_ADDRESS_SPACE_NUM == 1
>  static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu)
>  {
> return 0;
> --
> 2.42.0.820.g83a721a137-goog
>


Re: [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory

2023-11-02 Thread Fuad Tabba
On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson  wrote:
>
> From: Chao Peng 
>
> Add support for resolving page faults on guest private memory for VMs
> that differentiate between "shared" and "private" memory.  For such VMs,
> KVM_MEM_PRIVATE memslots can include both fd-based private memory and
> hva-based shared memory, and KVM needs to map in the "correct" variant,
> i.e. KVM needs to map the gfn shared/private as appropriate based on the
> current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag.
>
> For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request
> shared vs. private via a bit in the guest page tables, i.e. what the guest
> wants may conflict with the current memory attributes.  To support such
> "implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT
> to forward the request to userspace.  Add a new flag for memory faults,
> KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to
> map memory as shared vs. private.
>
> Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory
> so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace
> needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to
> exit on missing mappings when handling guest page fault VM-Exits.  In
> that case, userspace will want to know RWX information in order to
> correctly/precisely resolve the fault.
>
> Note, private memory *must* be backed by guest_memfd, i.e. shared mappings
> always come from the host userspace page tables, and private mappings
> always come from a guest_memfd instance.
>
> Co-developed-by: Yu Zhang 
> Signed-off-by: Yu Zhang 
> Signed-off-by: Chao Peng 
> Co-developed-by: Sean Christopherson 
> Signed-off-by: Sean Christopherson 
> ---

With my limited understanding of kvm-x86 mmu code:
Reviewed-by: Fuad Tabba 

Tested the x86 code (as part of this patch series) on qemu. The x86
fault handling code was used as a guide to how it should be done for
pKVM/arm64 (with similar code added there):
Tested-by: Fuad Tabba 

Cheers,
/fuad

>  Documentation/virt/kvm/api.rst  |   8 ++-
>  arch/x86/kvm/mmu/mmu.c  | 101 ++--
>  arch/x86/kvm/mmu/mmu_internal.h |   1 +
>  include/linux/kvm_host.h|   8 ++-
>  include/uapi/linux/kvm.h|   1 +
>  5 files changed, 110 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 7f00c310c24a..38dc1fda4f45 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6837,6 +6837,7 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
>
> /* KVM_EXIT_MEMORY_FAULT */
> struct {
> +  #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3)
> __u64 flags;
> __u64 gpa;
> __u64 size;
> @@ -6845,8 +6846,11 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
>  KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that
>  could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
>  guest physical address range [gpa, gpa + size) of the fault.  The 'flags' 
> field
> -describes properties of the faulting access that are likely pertinent.
> -Currently, no flags are defined.
> +describes properties of the faulting access that are likely pertinent:
> +
> + - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault 
> occurred
> +   on a private memory access.  When clear, indicates the fault occurred on a
> +   shared access.
>
>  Note!  KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it
>  accompanies a return code of '-1', not '0'!  errno will always be set to 
> EFAULT
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 4167d557c577..c4e758f0aebb 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3147,9 +3147,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, 
> gfn_t gfn,
> return level;
>  }
>
> -int kvm_mmu_max_mapping_level(struct kvm *kvm,
> - const struct kvm_memory_slot *slot, gfn_t gfn,
> - int max_level)
> +static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
> +  const struct kvm_memory_slot *slot,
> +  gfn_t gfn, int max_level, bool 
> is_private)
>  {
> struct kvm_lpage_info *linfo;
> int host_level;
> @@ -3161,6 +3161,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> break;
> }
>
> +   if (is_private)
> +   return max_level;
> +
> if (max_level == PG_LEVEL_4K)
> return PG_LEVEL_4K;
>
> @@ -3168,6 +3171,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm,
> return min(host_level, max_level);
>  }
>
> +int kvm_mmu_max_mapping_level(struct kvm *kvm,
> + const st

Re: [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN

2023-11-02 Thread Fuad Tabba
On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson  wrote:
>
> Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce
> the probability of exiting to userspace with a stale run->exit_reason that
> *appears* to be valid.
>
> To support fd-based guest memory (guest memory without a corresponding
> userspace virtual address), KVM will exit to userspace for various memory
> related errors, which userspace *may* be able to resolve, instead of using
> e.g. BUS_MCEERR_AR.  And in the more distant future, KVM will also likely
> utilize the same functionality to let userspace "intercept" and handle
> memory faults when the userspace mapping is missing, i.e. when fast gup()
> fails.
>
> Because many of KVM's internal APIs related to guest memory use '0' to
> indicate "success, continue on" and not "exit to userspace", reporting
> memory faults/errors to userspace will set run->exit_reason and
> corresponding fields in the run structure fields in conjunction with a
> a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON.  And because
> KVM already returns  -EFAULT in many paths, there's a relatively high
> probability that KVM could return -EFAULT without setting run->exit_reason,
> in which case reporting KVM_EXIT_UNKNOWN is much better than reporting
> whatever exit reason happened to be in the run structure.
>
> Note, KVM must wait until after run->immediate_exit is serviced to
> sanitize run->exit_reason as KVM's ABI is that run->exit_reason is
> preserved across KVM_RUN when run->immediate_exit is true.
>
> Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoor...@google.com
> Link: https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com
> Signed-off-by: Sean Christopherson 
> ---

Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 

Cheers,
/fuad

>  arch/x86/kvm/x86.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ee3cd8c3c0ef..f41dbb1465a0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10963,6 +10963,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  {
> int r;
>
> +   vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
> vcpu->arch.l1tf_flush_l1d = true;
>
> for (;;) {
> --
> 2.42.0.820.g83a721a137-goog
>


Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-11-02 Thread Fuad Tabba
Hi,

On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson  wrote:
>
> Add flags to "struct kvm_gfn_range" to let notifier events target only
> shared and only private mappings, and write up the existing mmu_notifier
> events to be shared-only (private memory is never associated with a
> userspace virtual address, i.e. can't be reached via mmu_notifiers).
>
> Add two flags so that KVM can handle the three possibilities (shared,
> private, and shared+private) without needing something like a tri-state
> enum.
>
> Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com
> Signed-off-by: Sean Christopherson 
> ---
>  include/linux/kvm_host.h | 2 ++
>  virt/kvm/kvm_main.c  | 7 +++
>  2 files changed, 9 insertions(+)
>
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 96aa930536b1..89c1a991a3b8 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -263,6 +263,8 @@ struct kvm_gfn_range {
> gfn_t start;
> gfn_t end;
> union kvm_mmu_notifier_arg arg;
> +   bool only_private;
> +   bool only_shared;

If these flags aren't used in this patch series, should this patch be
moved to the other series?

Also, if shared+private is a possibility, doesn't the prefix "only_"
confuse things a bit? I.e., what is shared+private, is it when both
are 0 or when both are 1? I assume it's the former (both are 0), but
it might be clearer.

Cheers,
/fuad

> bool may_block;
>  };
>  bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index cb9376833c18..302ccb87b4c1 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t 
> __kvm_handle_hva_range(struct kvm *kvm,
>  * the second or later invocation of the handler).
>  */
> gfn_range.arg = range->arg;
> +
> +   /*
> +* HVA-based notifications aren't relevant to private
> +* mappings as they don't have a userspace mapping.
> +*/
> +   gfn_range.only_private = false;
> +   gfn_range.only_shared = true;
> gfn_range.may_block = range->may_block;
>
> /*
> --
> 2.42.0.820.g83a721a137-goog
>


Re: [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook

2023-11-02 Thread Fuad Tabba
On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson  wrote:
>
> Drop the .on_unlock() mmu_notifer hook now that it's no longer used for
> notifying arch code that memory has been reclaimed.  Adding .on_unlock()
> and invoking it *after* dropping mmu_lock was a terrible idea, as doing so
> resulted in .on_lock() and .on_unlock() having divergent and asymmetric
> behavior, and set future developers up for failure, i.e. all but asked for
> bugs where KVM relied on using .on_unlock() to try to run a callback while
> holding mmu_lock.
>
> Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to
> guard against future bugs of this nature.
>
> Reported-by: Isaku Yamahata 
> Link: 
> https://lore.kernel.org/all/20230802203119.gb2021...@ls.amr.corp.intel.com
> Signed-off-by: Sean Christopherson 
> ---

Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 

Cheers,
/fuad

>  virt/kvm/kvm_main.c | 13 +++--
>  1 file changed, 3 insertions(+), 10 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 2bc04c8ae1f4..cb9376833c18 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -544,7 +544,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct 
> mmu_notifier *mn)
>  typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range);
>
>  typedef void (*on_lock_fn_t)(struct kvm *kvm);
> -typedef void (*on_unlock_fn_t)(struct kvm *kvm);
>
>  struct kvm_mmu_notifier_range {
> /*
> @@ -556,7 +555,6 @@ struct kvm_mmu_notifier_range {
> union kvm_mmu_notifier_arg arg;
> gfn_handler_t handler;
> on_lock_fn_t on_lock;
> -   on_unlock_fn_t on_unlock;
> bool flush_on_ret;
> bool may_block;
>  };
> @@ -663,11 +661,8 @@ static __always_inline kvm_mn_ret_t 
> __kvm_handle_hva_range(struct kvm *kvm,
> if (range->flush_on_ret && r.ret)
> kvm_flush_remote_tlbs(kvm);
>
> -   if (r.found_memslot) {
> +   if (r.found_memslot)
> KVM_MMU_UNLOCK(kvm);
> -   if (!IS_KVM_NULL_FN(range->on_unlock))
> -   range->on_unlock(kvm);
> -   }
>
> srcu_read_unlock(&kvm->srcu, idx);
>
> @@ -687,7 +682,6 @@ static __always_inline int kvm_handle_hva_range(struct 
> mmu_notifier *mn,
> .arg= arg,
> .handler= handler,
> .on_lock= (void *)kvm_null_fn,
> -   .on_unlock  = (void *)kvm_null_fn,
> .flush_on_ret   = true,
> .may_block  = false,
> };
> @@ -706,7 +700,6 @@ static __always_inline int 
> kvm_handle_hva_range_no_flush(struct mmu_notifier *mn
> .end= end,
> .handler= handler,
> .on_lock= (void *)kvm_null_fn,
> -   .on_unlock  = (void *)kvm_null_fn,
> .flush_on_ret   = false,
> .may_block  = false,
> };
> @@ -813,7 +806,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
> mmu_notifier *mn,
> .end= range->end,
> .handler= kvm_mmu_unmap_gfn_range,
> .on_lock= kvm_mmu_invalidate_begin,
> -   .on_unlock  = (void *)kvm_null_fn,
> .flush_on_ret   = true,
> .may_block  = mmu_notifier_range_blockable(range),
> };
> @@ -858,6 +850,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct 
> mmu_notifier *mn,
>
>  void kvm_mmu_invalidate_end(struct kvm *kvm)
>  {
> +   lockdep_assert_held_write(&kvm->mmu_lock);
> +
> /*
>  * This sequence increase will notify the kvm page fault that
>  * the page that is going to be mapped in the spte could have
> @@ -889,7 +883,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct 
> mmu_notifier *mn,
> .end= range->end,
> .handler= (void *)kvm_null_fn,
> .on_lock= kvm_mmu_invalidate_end,
> -   .on_unlock  = (void *)kvm_null_fn,
> .flush_on_ret   = false,
> .may_block  = mmu_notifier_range_blockable(range),
> };
> --
> 2.42.0.820.g83a721a137-goog
>


Re: [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory

2023-11-02 Thread Fuad Tabba
On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson  wrote:
>
> Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having
> __kvm_handle_hva_range() return whether or not an overlapping memslot
> was found, i.e. mmu_lock was acquired.  Using the .on_unlock() hook
> works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping
> mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical.
>
> Use a small struct to return the tuple of the notifier-specific return,
> plus whether or not overlap was found.  Because the iteration helpers are
> __always_inlined, practically speaking, the struct will never actually be
> returned from a function call (not to mention the size of the struct will
> be two bytes in practice).
>
> Signed-off-by: Sean Christopherson 
> ---

Reviewed-by: Fuad Tabba 
Tested-by: Fuad Tabba 

Cheers,
/fuad


>  virt/kvm/kvm_main.c | 53 +++--
>  1 file changed, 37 insertions(+), 16 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3f5b7c2c5327..2bc04c8ae1f4 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -561,6 +561,19 @@ struct kvm_mmu_notifier_range {
> bool may_block;
>  };
>
> +/*
> + * The inner-most helper returns a tuple containing the return value from the
> + * arch- and action-specific handler, plus a flag indicating whether or not 
> at
> + * least one memslot was found, i.e. if the handler found guest memory.
> + *
> + * Note, most notifiers are averse to booleans, so even though KVM tracks the
> + * return from arch code as a bool, outer helpers will cast it to an int. :-(
> + */
> +typedef struct kvm_mmu_notifier_return {
> +   bool ret;
> +   bool found_memslot;
> +} kvm_mn_ret_t;
> +
>  /*
>   * Use a dedicated stub instead of NULL to indicate that there is no callback
>   * function/handler.  The compiler technically can't guarantee that a real
> @@ -582,22 +595,25 @@ static const union kvm_mmu_notifier_arg 
> KVM_MMU_NOTIFIER_NO_ARG;
>  node;   \
>  node = interval_tree_iter_next(node, start, last))  \
>
> -static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
> - const struct 
> kvm_mmu_notifier_range *range)
> +static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm,
> +  const struct 
> kvm_mmu_notifier_range *range)
>  {
> -   bool ret = false, locked = false;
> +   struct kvm_mmu_notifier_return r = {
> +   .ret = false,
> +   .found_memslot = false,
> +   };
> struct kvm_gfn_range gfn_range;
> struct kvm_memory_slot *slot;
> struct kvm_memslots *slots;
> int i, idx;
>
> if (WARN_ON_ONCE(range->end <= range->start))
> -   return 0;
> +   return r;
>
> /* A null handler is allowed if and only if on_lock() is provided. */
> if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) &&
>  IS_KVM_NULL_FN(range->handler)))
> -   return 0;
> +   return r;
>
> idx = srcu_read_lock(&kvm->srcu);
>
> @@ -631,8 +647,8 @@ static __always_inline int __kvm_handle_hva_range(struct 
> kvm *kvm,
> gfn_range.end = hva_to_gfn_memslot(hva_end + 
> PAGE_SIZE - 1, slot);
> gfn_range.slot = slot;
>
> -   if (!locked) {
> -   locked = true;
> +   if (!r.found_memslot) {
> +   r.found_memslot = true;
> KVM_MMU_LOCK(kvm);
> if (!IS_KVM_NULL_FN(range->on_lock))
> range->on_lock(kvm);
> @@ -640,14 +656,14 @@ static __always_inline int 
> __kvm_handle_hva_range(struct kvm *kvm,
> if (IS_KVM_NULL_FN(range->handler))
> break;
> }
> -   ret |= range->handler(kvm, &gfn_range);
> +   r.ret |= range->handler(kvm, &gfn_range);
> }
> }
>
> -   if (range->flush_on_ret && ret)
> +   if (range->flush_on_ret && r.ret)
> kvm_flush_remote_tlbs(kvm);
>
> -   if (locked) {
> +   if (r.found_memslot) {
> KVM_MMU_UNLOCK(kvm);
> if (!IS_KVM_NULL_FN(range->on_unlock))
> range->on_unlock(kvm);
> @@ -655,8 +671,7 @@ static __always_inline int __kvm_handle_hva_range(struct 
> kvm *kvm,
>
> srcu_read_unlock(&kvm->srcu, idx);
>
> -   /* The notifiers are averse to booleans. :-( */
> -   return (int)ret;
> +   return r;
>  }
>
>  static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn,
> @@ -677,7 +692

Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory

2023-11-02 Thread Fuad Tabba
On Wed, Nov 1, 2023 at 9:55 PM Sean Christopherson  wrote:
>
> On Wed, Nov 01, 2023, Fuad Tabba wrote:
> > > > > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct 
> > > > > kvm_memory_slot *memslot)
> > > > >  /* This does not remove the slot from struct kvm_memslots data 
> > > > > structures */
> > > > >  static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot 
> > > > > *slot)
> > > > >  {
> > > > > +   if (slot->flags & KVM_MEM_PRIVATE)
> > > > > +   kvm_gmem_unbind(slot);
> > > > > +
> > > >
> > > > Should this be called after kvm_arch_free_memslot()? Arch-specific ode
> > > > might need some of the data before the unbinding, something I thought
> > > > might be necessary at one point for the pKVM port when deleting a
> > > > memslot, but realized later that kvm_invalidate_memslot() ->
> > > > kvm_arch_guest_memory_reclaimed() was the more logical place for it.
> > > > Also, since that seems to be the pattern for arch-specific handlers in
> > > > KVM.
> > >
> > > Maybe?  But only if we can about symmetry between the allocation and free 
> > > paths
> > > I really don't think kvm_arch_free_memslot() should be doing anything 
> > > beyond a
> > > "pure" free.  E.g. kvm_arch_free_memslot() is also called after moving a 
> > > memslot,
> > > which hopefully we never actually have to allow for guest_memfd, but any 
> > > code in
> > > kvm_arch_free_memslot() would bring about "what if" questions regarding 
> > > memslot
> > > movement.  I.e. the API is intended to be a "free arch metadata 
> > > associated with
> > > the memslot".
> > >
> > > Out of curiosity, what does pKVM need to do at 
> > > kvm_arch_guest_memory_reclaimed()?
> >
> > It's about the host reclaiming ownership of guest memory when tearing
> > down a protected guest. In pKVM, we currently teardown the guest and
> > reclaim its memory when kvm_arch_destroy_vm() is called. The problem
> > with guestmem is that kvm_gmem_unbind() could get called before that
> > happens, after which the host might try to access the unbound guest
> > memory. Since the host hasn't reclaimed ownership of the guest memory
> > from hyp, hilarity ensues (it crashes).
> >
> > Initially, I hooked reclaim guest memory to kvm_free_memslot(), but
> > then I needed to move the unbind later in the function. I realized
> > later that kvm_arch_guest_memory_reclaimed() gets called earlier (at
> > the right time), and is more aptly named.
>
> Aha!  I suspected that might be the case.
>
> TDX and SNP also need to solve the same problem of "reclaiming" memory before 
> it
> can be safely accessed by the host.  The plan is to add an arch hook (or two?)
> into guest_memfd that is invoked when memory is freed from guest_memfd.
>
> Hooking kvm_arch_guest_memory_reclaimed() isn't completely correct as 
> deleting a
> memslot doesn't *guarantee* that guest memory is actually reclaimed (which 
> reminds
> me, we need to figure out a better name for that thing before introducing
> kvm_arch_gmem_invalidate()).

I see. I'd assumed that that was what you're using. I agree that it's
not completely correct, so for the moment, I assume that if that
happens we have a misbehaving host, teardown the guest and reclaim its
memory.

> The effective false positives aren't fatal for the current usage because the 
> hook
> is used only for x86 SEV guests to flush caches.  An unnecessary flush can 
> cause
> performance issues, but it doesn't affect correctness. For TDX and SNP, and 
> IIUC
> pKVM, false positives are fatal because KVM could assign memory back to the 
> host
> that is still owned by guest_memfd.

Yup.

> E.g. a misbehaving userspace could prematurely delete a memslot.  And the more
> fun example is intrahost migration, where the plan is to allow pointing 
> multiple
> guest_memfd files at a single guest_memfd inode:
> https://lore.kernel.org/all/cover.1691446946.git.ackerley...@google.com
>
> There was a lot of discussion for this, but it's scattered all over the place.
> The TL;DR is is that the inode will represent physical memory, and a file will
> represent a given "struct kvm" instance's view of that memory.  And so the 
> memory
> isn't reclaimed until the inode is truncated/punched.
>
> I _think_ this reflects the most recent plan from the guest_memfd side:
> https://lore.kernel.org/all/1233d749211c08d51f9ca5d427938d47f008af1f.1689893403.git.isaku.yamah...@intel.com

Thanks for pointing that out. I think this might be the way to go.
I'll have a closer look at this and see how to get it to work with
pKVM.

Cheers,
/fuad


[RFC PATCH] powerpc/book3s/hash: Drop _PAGE_PRIVILEGED from PAGE_NONE

2023-11-02 Thread Aneesh Kumar K.V
There used to be a dependency on _PAGE_PRIVILEGED with pte_savedwrite.
But that got dropped by
commit 6a56ccbcf6c6 ("mm/autonuma: use can_change_(pte|pmd)_writable() to 
replace savedwrite")

With this change numa fault pte (pte_protnone()) gets mapped as regular user pte
with RWX cleared (no-access). This also remove pte_user() from
book3s/64.

pte_access_permitted() now checks for _PAGE_EXEC because we now support
EXECONLY mappings.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h | 23 +---
 arch/powerpc/mm/book3s64/hash_utils.c| 17 +++
 2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index cb77eddca54b..7c7de7b56df0 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -17,12 +17,6 @@
 #define _PAGE_EXEC 0x1 /* execute permission */
 #define _PAGE_WRITE0x2 /* write access allowed */
 #define _PAGE_READ 0x4 /* read access allowed */
-#define _PAGE_NA   _PAGE_PRIVILEGED
-#define _PAGE_NAX  _PAGE_EXEC
-#define _PAGE_RO   _PAGE_READ
-#define _PAGE_ROX  (_PAGE_READ | _PAGE_EXEC)
-#define _PAGE_RW   (_PAGE_READ | _PAGE_WRITE)
-#define _PAGE_RWX  (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC)
 #define _PAGE_PRIVILEGED   0x8 /* kernel access only */
 #define _PAGE_SAO  0x00010 /* Strong access order */
 #define _PAGE_NON_IDEMPOTENT   0x00020 /* non idempotent memory */
@@ -119,9 +113,9 @@
 /*
  * user access blocked by key
  */
-#define _PAGE_KERNEL_RW(_PAGE_PRIVILEGED | _PAGE_RW | 
_PAGE_DIRTY)
 #define _PAGE_KERNEL_RO (_PAGE_PRIVILEGED | _PAGE_READ)
 #define _PAGE_KERNEL_ROX(_PAGE_PRIVILEGED | _PAGE_READ | _PAGE_EXEC)
+#define _PAGE_KERNEL_RW(_PAGE_PRIVILEGED | _PAGE_RW | 
_PAGE_DIRTY)
 #define _PAGE_KERNEL_RWX   (_PAGE_PRIVILEGED | _PAGE_DIRTY | _PAGE_RW | 
_PAGE_EXEC)
 /*
  * _PAGE_CHG_MASK masks of bits that are to be preserved across
@@ -523,19 +517,14 @@ static inline bool arch_pte_access_permitted(u64 pte, 
bool write, bool execute)
 }
 #endif /* CONFIG_PPC_MEM_KEYS */
 
-static inline bool pte_user(pte_t pte)
-{
-   return !(pte_raw(pte) & cpu_to_be64(_PAGE_PRIVILEGED));
-}
-
 #define pte_access_permitted pte_access_permitted
 static inline bool pte_access_permitted(pte_t pte, bool write)
 {
-   /*
-* _PAGE_READ is needed for any access and will be
-* cleared for PROT_NONE
-*/
-   if (!pte_present(pte) || !pte_user(pte) || !pte_read(pte))
+
+   if (!pte_present(pte))
+   return false;
+
+   if (!(pte_read(pte) || pte_exec(pte)))
return false;
 
if (write && !pte_write(pte))
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index ad2afa08e62e..b2eda22195f0 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -310,9 +310,26 @@ unsigned long htab_convert_pte_flags(unsigned long 
pteflags, unsigned long flags
else
rflags |= 0x3;
}
+   WARN_ON(!(pteflags & _PAGE_RWX));
} else {
if (pteflags & _PAGE_RWX)
rflags |= 0x2;
+   else {
+   /*
+* PAGE_NONE will get mapped to 0b110 (slb key 1 no 
access)
+* We picked 0b110 instead of 0b000 so that slb key 0 
will
+* get only read only access for the same rflags.
+*/
+   if (mmu_has_feature(MMU_FTR_KERNEL_RO))
+   rflags |= (HPTE_R_PP0 | 0x2);
+   /*
+* rflags = HPTE_R_N
+* Without KERNEL_RO feature this will result in slb
+* key 0 with read/write. But ISA only supports that.
+* There is no key 1 no-access and key 0 read-only
+* pp bit support.
+*/
+   }
if (!((pteflags & _PAGE_WRITE) && (pteflags & _PAGE_DIRTY)))
rflags |= 0x1;
}
-- 
2.41.0



Re: [PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes

2023-11-02 Thread Michael Ellerman
Matthew Wilcox  writes:
> On Tue, Oct 24, 2023 at 08:06:04PM +0530, Aneesh Kumar K.V wrote:
>>  ptep++;
>> -pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT));
>>  addr += PAGE_SIZE;
>> +/*
>> + * increment the pfn.
>> + */
>> +pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>
> when i looked at this, it generated shit code.  did you check?

I didn't look ...



It's not super clear cut. There's some difference because pfn_pte()
contains two extra VM_BUG_ONs.

But with DEBUG_VM *off* the version using pfn_pte() generates *better*
code, or at least less code, ~160 instructions vs ~200.

For some reason the version using PTE_RPN_SHIFT seems to be byte
swapping the pte an extra two times, each of which generates ~8
instructions. But I can't see why.

I tried a few other things and couldn't come up with anything that
generated better code. But I'll keep poking at it tomorrow.

cheers


Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events

2023-11-02 Thread Paolo Bonzini

On 11/2/23 06:59, Binbin Wu wrote:



Add flags to "struct kvm_gfn_range" to let notifier events target
only shared and only private mappings, and write up the existing
mmu_notifier events to be shared-only (private memory is never
associated with a userspace virtual address, i.e. can't be reached
via mmu_notifiers).

Add two flags so that KVM can handle the three possibilities
(shared, private, and shared+private) without needing something
like a tri-state enum.


I see the two flags are set/cleared in __kvm_handle_hva_range() in
this patch and kvm_handle_gfn_range() from the later patch 13/35, but
I didn't see they are used/read in this patch series if I didn't miss
anything.  How are they supposed to be used in KVM?


They are going to be used by SNP/TDX patches.

Paolo



Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Paolo Bonzini

On 11/2/23 10:35, Huang, Kai wrote:

IIUC KVM can already handle the case of poisoned
page by sending signal to user app: 

	static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, 
			struct kvm_page_fault *fault)   
	{   
		...


if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(fault->slot, fault->gfn);
	return RET_PF_RETRY;  
	}

}


EHWPOISON is not implemented by this series, so it should be left out of 
the documentation.




Currently as mentioned above when
vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and
Qemu prints ...

...: Bad address


... which is nonsense.

If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry
a specific value for EPC to let Qemu know and Qemu can then do more reasonable
things.


Yes, that's a good idea that can be implemented on top.

Paolo



Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Paolo Bonzini

On 11/1/23 18:36, Sean Christopherson wrote:

A good example is KVM_RUN with -EINTR; if KVM were to return something other 
than
-EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall 
over.


And dually if KVM were to return KVM_EXIT_INTR together with something 
other than -EINTR.


And purging exit_reason super early is subtly tricky because KVM's 
(again, poorly documented) ABI is that *some* exit reasons are preserved 
across KVM_RUN with vcpu->run->immediate_exit (or with a pending 
signal). https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com


vcpu->run->immediate_exit preserves all exit reasons, but it's not a 
good idea that immediate_exit behaves different from a pending signal on 
entry to KVM_RUN (remember that immediate_exit was meant to be a better 
performing alternative to KVM_SET_SIGNAL_MASK).


In principle, vcpu->run->immediate_exit could return KVM_EXIT_INTR 
(perhaps even _should_, except that breaks selftests so at this point it 
is ABI).


Paolo



Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes

2023-11-02 Thread Huang, Kai
On Thu, 2023-11-02 at 11:32 +0100, Paolo Bonzini wrote:
> > > +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES
> > > +static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, 
> > > gfn_t gfn)
> > > +{
> > > + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
> > > +}
> > 
> > Only call xa_to_value() when xa_load() returns !NULL?
> 
> This xarray does not store a pointer, therefore xa_load() actually 
> returns an integer that is tagged with 1 in the low bit:
> 
> static inline unsigned long xa_to_value(const void *entry)
> {
>  return (unsigned long)entry >> 1;
> }
> 
> Returning zero for an empty entry is okay, so the result of xa_load() 
> can be used directly.

Thanks for explaining.  I was thinking perhaps it's better to do:

void *entry = xa_load(...);

return xa_is_value(entry) ? xa_to_value(entry) : 0;

But "NULL (0) >> 1" is still 0, so yes we can use directly.


Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes

2023-11-02 Thread Paolo Bonzini

On 11/2/23 04:01, Huang, Kai wrote:

On Fri, 2023-10-27 at 11:21 -0700, Sean Christopherson wrote:

From: Chao Peng 

In confidential computing usages, whether a page is private or shared is
necessary information for KVM to perform operations like page fault
handling, page zapping etc. There are other potential use cases for
per-page memory attributes, e.g. to make memory read-only (or no-exec,
or exec-only, etc.) without having to modify memslots.

Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow
userspace to operate on the per-page memory attributes.
   - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to
 a guest memory range.
   - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported
 memory attributes.

Use an xarray to store the per-page attributes internally, with a naive,
not fully optimized implementation, i.e. prioritize correctness over
performance for the initial implementation.

Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX
attributes/protections in the future, e.g. to give userspace fine-grained
control over read, write, and execute protections for guest memory.

Provide arch hooks for handling attribute changes before and after common
code sets the new attributes, e.g. x86 will use the "pre" hook to zap all
relevant mappings, and the "post" hook to track whether or not hugepages
can be used to map the range.

To simplify the implementation wrap the entire sequence with
kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly
guaranteed to be an invalidation.  For the initial use case, x86 *will*
always invalidate memory, and preventing arch code from creating new
mappings while the attributes are in flux makes it much easier to reason
about the correctness of consuming attributes.

It's possible that future usages may not require an invalidation, e.g.
if KVM ends up supporting RWX protections and userspace grants _more_
protections, but again opt for simplicity and punt optimizations to
if/when they are needed.

Suggested-by: Sean Christopherson 
Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com
Cc: Fuad Tabba 
Cc: Xu Yilun 
Cc: Mickaël Salaün 
Signed-off-by: Chao Peng 
Co-developed-by: Sean Christopherson 
Signed-off-by: Sean Christopherson 



[...]


+Note, there is no "get" API.  Userspace is responsible for explicitly tracking
+the state of a gfn/page as needed.
+



[...]

  
+#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES

+static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t 
gfn)
+{
+   return xa_to_value(xa_load(&kvm->mem_attr_array, gfn));
+}


Only call xa_to_value() when xa_load() returns !NULL?


This xarray does not store a pointer, therefore xa_load() actually 
returns an integer that is tagged with 1 in the low bit:


static inline unsigned long xa_to_value(const void *entry)
{
return (unsigned long)entry >> 1;
}

Returning zero for an empty entry is okay, so the result of xa_load() 
can be used directly.




+
+bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+unsigned long attrs);


Seems it's not immediately clear why this function is needed in this patch,
especially when you said there is no "get" API above.  Add some material to
changelog?


It's used by later patches; even without a "get" API, it's a pretty 
fundamental functionality.



+bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm,
+   struct kvm_gfn_range *range);
+bool kvm_arch_post_set_memory_attributes(struct kvm *kvm,
+struct kvm_gfn_range *range);


Looks if this Kconfig is on, the above two arch hooks won't have implementation.

Is it better to have two __weak empty versions here in this patch?

Anyway, from the changelog it seems it's not mandatory for some ARCH to provide
the above two if one wants to turn this on, i.e., the two hooks can be empty and
the ARCH can just use the __weak version.


I think this can be added by the first arch that needs memory attributes 
and also doesn't need one of these hooks.  Or perhaps the x86 
kvm_arch_pre_set_memory_attributes() could be made generic and thus that 
would be the __weak version.  It's too early to tell, so it's better to 
leave the implementation to the architectures for now.


Paolo



Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace

2023-11-02 Thread Huang, Kai
On Thu, 2023-11-02 at 03:17 +, Huang, Kai wrote:
> On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote:
> > On Wed, Nov 01, 2023, Kai Huang wrote:
> > > 
> > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO
> > > > +--
> > > > +
> > > > +:Architectures: x86
> > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
> > > > +
> > > > +The presence of this capability indicates that KVM_RUN will fill
> > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, 
> > > > e.g. if
> > > > +there is a valid memslot but no backing VMA for the corresponding host 
> > > > virtual
> > > > +address.
> > > > +
> > > > +The information in kvm_run.memory_fault is valid if and only if 
> > > > KVM_RUN returns
> > > > +an error with errno=EFAULT or errno=EHWPOISON *and* 
> > > > kvm_run.exit_reason is set
> > > > +to KVM_EXIT_MEMORY_FAULT.
> > > 
> > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal
> > > implementation.
> > 
> > The errno that is returned to userspace is ABI.  In KVM, it's a _very_ 
> > poorly
> > defined ABI for the vast majority of ioctls(), but it's still technically 
> > ABI.
> > KVM gets away with being cavalier with errno because the vast majority of 
> > errors
> > are considered fatal by userespace, i.e. in most cases, userspace simply 
> > doesn't
> > care about the exact errno.
> > 
> > A good example is KVM_RUN with -EINTR; if KVM were to return something 
> > other than
> > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would 
> > fall over.
> > 
> > > Is it better to relax the validity of kvm_run.memory_fault when
> > > KVM_RUN returns any -errno?
> > 
> > Not unless there's a need to do so, and if there is then we can update the
> > documentation accordingly.  If KVM's ABI is that kvm_run.memory_fault is 
> > valid
> > for any errno, then KVM would need to purge kvm_run.exit_reason super early 
> > in
> > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being
> > misinterpreted as KVM_EXIT_MEMORY_FAULT.  And purging exit_reason super 
> > early is
> > subtly tricky because KVM's (again, poorly documented) ABI is that *some* 
> > exit
> > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or 
> > with a
> > pending signal).
> > 
> > https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com
> > 
> > 
> 
> Agreed with not to relax to any errno.  However using -EFAULT as part of ABI
> definition seems a little bit dangerous, e.g., someone could accidentally or
> mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely
> different code path, etc.  -EINTR has well defined meaning, but -EFAULT (which
> is "Bad address") seems doesn't but I am not sure either. :-)
> 
> One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns
> KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc
> fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just
> return -EFAULT.  If kvm_run.exit_reason isn't purged early then is it possible
> to have some issue here?
> 

Also, regardless whether -EFAULT is too ambiguous to be part of ABI, could you
elaborate the EHWPOISON part?  IIUC KVM can already handle the case of poisoned
page by sending signal to user app: 

static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, 
struct kvm_page_fault *fault)   

{   
...

if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
kvm_send_hwpoison_signal(fault->slot, fault->gfn);
return RET_PF_RETRY;
  
}
}

And (sorry to hijack) I am thinking whether "SGX vepc unable to allocate EPC"
can also use this memory_fault mechanism.  Currently as mentioned above when
vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and
Qemu prints ...

...: Bad address


... which is nonsense.

If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry
a specific value for EPC to let Qemu know and Qemu can then do more reasonable
things.



Re: Does anyone use Appletalk?

2023-11-02 Thread John Paul Adrian Glaubitz
On Thu, 2023-11-02 at 13:13 +1100, Finn Thain wrote:
> So I can't object to the removal of the localtalk code. But I do object to 
> the underhand manner in which it is done.

I agree. I have the impression that the actual users of the affected code are
never asked. It's usually a question posed on a mailing list where 99% of the
affected users don't hang around.

Naturally, there won't be any objections to the removal because affected users
didn't receive a heads-up in the first place.

Adrian

-- 
 .''`.  John Paul Adrian Glaubitz
: :' :  Debian Developer
`. `'   Physicist
  `-GPG: 62FF 8A75 84E0 2956 9546  0006 7426 3B37 F5B5 F913


Re: [PATCH 1/2] kexec: fix KEXEC_FILE dependencies

2023-11-02 Thread Baoquan He
Hi Arnd,

On 10/24/23 at 03:17pm, Arnd Bergmann wrote:
> On Tue, Oct 24, 2023, at 14:44, Baoquan He wrote:
> > Just add people and mailing list to CC since I didn't find this mail in
> > my box, just drag it via 'b4 am'.
> >
> > On 10/23/23 at 01:01pm, Arnd Bergmann wrote:
> > ..
> 
> >> diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
> >> index 7aff28ded2f48..bfc636d64ff2b 100644
> >> --- a/kernel/Kconfig.kexec
> >> +++ b/kernel/Kconfig.kexec
> >> @@ -36,6 +36,7 @@ config KEXEC
> >>  config KEXEC_FILE
> >>bool "Enable kexec file based system call"
> >>depends on ARCH_SUPPORTS_KEXEC_FILE
> >> +  depends on CRYPTO_SHA256=y || !ARCH_SUPPORTS_KEXEC_PURGATORY
> >
> > I am not sure if the logic is correct. In theory, kexec_file code
> > utilizes purgatory to verify the checksum digested during kernel loading
> > when try to jump to the kernel. That means kexec_file depends on
> > purgatory, but not contrary?
> 
> The expression I wrote is a bit confusing, but I think this just
> keeps the existing behavior:
> 
> - on architectures that select ARCH_SUPPORTS_KEXEC_PURGATORY
>   (powerpc, riscv, s390 and x86), we also require CRYPTO_SHA256
>   to be built-in.
> - on architectures that do not have ARCH_SUPPORTS_KEXEC_PURGATORY
>   (arm64 and parisc), CRYPTO_SHA256 is not used and can be disabled
>   or =m.
> 
> Since ARCH_SUPPORTS_KEXEC_PURGATORY is a 'bool' symbol, it could
> be written as
> 
> depends on (ARCH_SUPPORTS_KEXEC_PURGATORY && CRYPTO_SHA256=y) \
>|| !ARCH_SUPPORTS_KEXEC_PURGATORY
> 
> if you find that clearer. I see that the second patch
> actually gets this wrong, it should actually do
> 
> select CRYPTO if ARCH_SUPPORTS_KEXEC_PURGATORY
> select CRYPTO_SHA256 if ARCH_SUPPORTS_KEXEC_PURGATORY
> 
> > With these changes, we can achieve the goal to avoid building issue,
> > whereas the code logic becomes confusing. E.g people could disable
> > CONFIG_KEXEC_FILE, but still get purgatory code built in which is
> > totally useless.
> >
> > Not sure if I think too much over this.
> 
> I see your point here, and I would suggest changing the
> CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY symbol to just indicate
> the availability of the purgatory code for the arch, rather
> than actually controlling the code itself. I already mentioned
> this for s390, but riscv would need the same thing on top.
> 
> I think the change below should address your concern.

Since no new comment, do you mind spinning v2 to wrap all these up?

Thanks
Baoquan

> 
>  Arnd
> 
> diff --git a/arch/riscv/kernel/elf_kexec.c b/arch/riscv/kernel/elf_kexec.c
> index e60fbd8660c4..3ac341d296db 100644
> --- a/arch/riscv/kernel/elf_kexec.c
> +++ b/arch/riscv/kernel/elf_kexec.c
> @@ -266,7 +266,7 @@ static void *elf_kexec_load(struct kimage *image, char 
> *kernel_buf,
> cmdline = modified_cmdline;
> }
>  
> -#ifdef CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY
> +#ifdef CONFIG_KEXEC_FILE
> /* Add purgatory to the image */
> kbuf.top_down = true;
> kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
> @@ -280,7 +280,7 @@ static void *elf_kexec_load(struct kimage *image, char 
> *kernel_buf,
>  sizeof(kernel_start), 0);
> if (ret)
> pr_err("Error update purgatory ret=%d\n", ret);
> -#endif /* CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY */
> +#endif /* CONFIG_KEXEC_FILE */
>  
> /* Add the initrd to the image */
> if (initrd != NULL) {
> diff --git a/arch/riscv/Kbuild b/arch/riscv/Kbuild
> index d25ad1c19f88..ab181d187c23 100644
> --- a/arch/riscv/Kbuild
> +++ b/arch/riscv/Kbuild
> @@ -5,7 +5,7 @@ obj-$(CONFIG_BUILTIN_DTB) += boot/dts/
>  obj-y += errata/
>  obj-$(CONFIG_KVM) += kvm/
>  
> -obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/
> +obj-$(CONFIG_KEXEC_FILE) += purgatory/
>  
>  # for cleaning
>  subdir- += boot
> diff --git a/arch/s390/Kbuild b/arch/s390/Kbuild
> index a5d3503b353c..361aa01dbd49 100644
> --- a/arch/s390/Kbuild
> +++ b/arch/s390/Kbuild
> @@ -7,7 +7,7 @@ obj-$(CONFIG_S390_HYPFS)+= hypfs/
>  obj-$(CONFIG_APPLDATA_BASE)+= appldata/
>  obj-y  += net/
>  obj-$(CONFIG_PCI)  += pci/
> -obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/
> +obj-$(CONFIG_KEXEC_FILE)   += purgatory/
>  
>  # for cleaning
>  subdir- += boot tools
>