[GIT PULL] Please pull powerpc/linux.git powerpc-6.7-1 tag
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hi Linus, Please pull powerpc updates for 6.7. There's one conflict, where we added Documentation/powerpc/kvm-nested.rst while upstream moved Documentation/powerpc to Documentation/arch/powerpc. Resolution is just to add it in the new location. cheers The following changes since commit ce9ecca0238b140b88f43859b211c9fdfd8e5b70: Linux 6.6-rc2 (2023-09-17 14:40:24 -0700) are available in the git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-6.7-1 for you to fetch changes up to 303d77a6e1707498f09c9d8ee91b1dc07ca315a5: Merge branch 'topic/ppc-kvm' into next (2023-10-27 20:58:03 +1100) - -- powerpc updates for 6.7 - Add support for KVM running as a nested hypervisor under development versions of PowerVM, using the new PAPR nested virtualisation API. - Add support for the BPF prog pack allocator. - A rework of the non-server MMU handling to support execute-only on all platforms. - Some optimisations & cleanups for the powerpc qspinlock code. - Various other small features and fixes. Thanks to: Aboorva Devarajan, Aditya Gupta, Amit Machhiwal, Benjamin Gray, Christophe Leroy, Dr. David Alan Gilbert, Gaurav Batra, Gautam Menghani, Geert Uytterhoeven, Haren Myneni, Hari Bathini, Joel Stanley, Jordan Niethe, Julia Lawall, Kautuk Consul, Kuan-Wei Chiu, Michael Neuling, Minjie Du, Muhammad Muzammil, Naveen N Rao, Nicholas Piggin, Nick Child, Nysal Jan K.A, Peter Lafreniere, Rob Herring, Sachin Sant, Sebastian Andrzej Siewior, Shrikanth Hegde, Srikar Dronamraju, Stanislav Kinsburskii, Vaibhav Jain, Wang Yufen, Yang Yingliang, Yuan Tan. - -- Aditya Gupta (2): powerpc: add `cur_cpu_spec` symbol to vmcoreinfo powerpc/vmcore: Add MMU information to vmcoreinfo Benjamin Gray (12): powerpc/configs: Set more PPC debug configs powerpc/xive: Fix endian conversion size powerpc: Explicitly reverse bytes when checking for byte reversal powerpc: Use NULL instead of 0 for null pointers powerpc: Remove extern from function implementations powerpc: Annotate endianness of various variables and functions powerpc/kvm: Force cast endianness of KVM shared regs powerpc/opal: Annotate out param endianness powerpc/uaccess: Cast away __user annotation after verification powerpc: Cast away __iomem in low level IO routines powerpc/eeh: Remove unnecessary cast powerpc/fadump: Annotate endianness cast with __force Christophe Leroy (38): powerpc: Only define __parse_fpscr() when required powerpc/40x: Remove stale PTE_ATOMIC_UPDATES macro powerpc: Remove pte_ERROR() powerpc: Deduplicate prototypes of ptep_set_access_flags() and phys_mem_access_prot() powerpc: Refactor update_mmu_cache_range() powerpc: Untangle fixmap.h and pgtable.h and mmu.h powerpc/nohash: Remove {pte/pmd}_protnone() powerpc/nohash: Refactor declaration of {map/unmap}_kernel_page() powerpc/nohash: Move 8xx version of pte_update() into pte-8xx.h powerpc/nohash: Replace #ifdef CONFIG_44x by IS_ENABLED(CONFIG_44x) in pgtable.h powerpc/nohash: Refactor pte_update() powerpc/nohash: Refactor checking of no-change in pte_update() powerpc/nohash: Deduplicate _PAGE_CHG_MASK powerpc/nohash: Deduplicate pte helpers powerpc/nohash: Refactor ptep_test_and_clear_young() powerpc/nohash: Deduplicate ptep_set_wrprotect() and ptep_get_and_clear() powerpc/nohash: Refactor pte_clear() powerpc/nohash: Refactor __ptep_set_access_flags() powerpc/e500: Simplify pte_mkexec() powerpc: Implement and use pgprot_nx() powerpc: Fail ioremap() instead of silently ignoring flags when PAGE_USER is set powerpc: Remove pte_mkuser() and pte_mkpriviledged() powerpc: Rely on address instead of pte_user() powerpc: Refactor permission masks used for __P/__S table and kernel memory flags powerpc/8xx: Use generic permission masks powerpc/64s: Use generic permission masks powerpc/nohash: Add _PAGE_WRITE to supplement _PAGE_RW powerpc/nohash: Replace pte_user() by pte_read() powerpc/e500: Introduce _PAGE_READ and remove _PAGE_USER powerpc/44x: Introduce _PAGE_READ and remove _PAGE_USER powerpc/40x: Introduce _PAGE_READ and remove _PAGE_USER powerpc/32s: Add _PAGE_WRITE to supplement _PAGE_RW powerpc/32s: Introduce _PAGE_READ and remove _PAGE_USER powerpc/ptdump: Display _PAGE_READ and _PAGE_WRITE powerpc: Finally remove _PAGE_USER powerpc: Support execute-only on all powerpc powerpc: Remove cpm_dp...() macros powerpc/code-patching: Perform hwsync in __patch_instruction() in case of failure Dr. David Alan Gilbert (1): powerpc: Use shared font data Gaura
Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
On Fri, Oct 27, 2023 at 11:21:51AM -0700, Sean Christopherson wrote: > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6723,6 +6723,26 @@ array field represents return values. The userspace > should update the return > values of SBI call before resuming the VCPU. For more details on RISC-V SBI > spec refer, https://github.com/riscv/riscv-sbi-doc. > > +:: > + > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory; ^ Should update to "memory_fault" to align with other places. [...] > @@ -520,6 +521,12 @@ struct kvm_run { > #define KVM_NOTIFY_CONTEXT_INVALID (1 << 0) > __u32 flags; > } notify; > + /* KVM_EXIT_MEMORY_FAULT */ > + struct { > + __u64 flags; > + __u64 gpa; > + __u64 size; > + } memory_fault; > /* Fix the size of the union. */ > char padding[256]; > }; Thanks, Yilun >
Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
On Thu, 2023-11-02 at 08:44 -0700, Sean Christopherson wrote: > On Thu, Nov 02, 2023, Paolo Bonzini wrote: > > On 11/2/23 10:35, Huang, Kai wrote: > > > IIUC KVM can already handle the case of poisoned > > > page by sending signal to user app: > > > > > > static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, > > > struct > > > kvm_page_fault *fault) > > > { > > > ... > > > > > > if (fault->pfn == KVM_PFN_ERR_HWPOISON) { > > > kvm_send_hwpoison_signal(fault->slot, > > > fault->gfn); > > No, this doesn't work, because that signals the host virtual address Ah, right :-)
Re: [PATCH v3 4/4] PCI: layerscape: Add suspend/resume for ls1043a
On Thu, Nov 02, 2023 at 11:09:00PM +0530, Manivannan Sadhasivam wrote: > On Tue, Oct 17, 2023 at 03:31:45PM -0400, Frank Li wrote: > > ls1043a add suspend/resume support. > > Implement ls1043a_pcie_send_turnoff_msg() to send PME_Turn_Off message. > > Implement ls1043a_pcie_exit_from_l2() to exit from L2 state. > > > > Please use the suggestion I gave in patch 2/4. > > > Signed-off-by: Frank Li > > --- > > > > Notes: > > Change from v2 to v3 > > - Remove ls_pcie_lut_readl(writel) function > > > > Change from v1 to v2 > > - Update subject 'a' to 'A' > > > > drivers/pci/controller/dwc/pci-layerscape.c | 86 - > > 1 file changed, 85 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > > b/drivers/pci/controller/dwc/pci-layerscape.c > > index 4b663b20d8612..9656224960b0c 100644 > > --- a/drivers/pci/controller/dwc/pci-layerscape.c > > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > > @@ -41,6 +41,15 @@ > > #define SCFG_PEXSFTRSTCR 0x190 > > #define PEXSR(idx) BIT(idx) > > > > +/* LS1043A PEX PME control register */ > > +#define SCFG_PEXPMECR 0x144 > > +#define PEXPME(idx)BIT(31 - (idx) * 4) > > + > > +/* LS1043A PEX LUT debug register */ > > +#define LS_PCIE_LDBG 0x7fc > > +#define LDBG_SRBIT(30) > > +#define LDBG_WEBIT(31) > > + > > #define PCIE_IATU_NUM 6 > > > > #define LS_PCIE_DRV_SCFG BIT(0) > > @@ -227,6 +236,68 @@ static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp > > *pp) > > return 0; > > } > > > > +static void ls1043a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > > +{ > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > + u32 val; > > + > > + if (!pcie->scfg) { > > + dev_dbg(pcie->pci->dev, "SYSCFG is NULL\n"); > > + return; > > + } > > Why scfg is optional for this SoC and not for the other one added in patch > 2/4? No, it is not optional for this SoC. This check can be removed as your previous comments about 2/4. > > > + > > + /* Send Turn_off message */ > > + regmap_read(pcie->scfg, SCFG_PEXPMECR, &val); > > + val |= PEXPME(pcie->index); > > + regmap_write(pcie->scfg, SCFG_PEXPMECR, val); > > + > > In my previous review, I asked you to use a common function and just pass the > offsets, as the sequence is same for both the SoCs. But you ignored it :/ > Sorry, I will fixed it at next version. > > + /* > > +* There is no specific register to check for PME_To_Ack from endpoint. > > +* So on the safe side, wait for PCIE_PME_TO_L2_TIMEOUT_US. > > +*/ > > + mdelay(PCIE_PME_TO_L2_TIMEOUT_US/1000); > > + > > + /* > > +* Layerscape hardware reference manual recommends clearing the > > PMXMTTURNOFF bit > > +* to complete the PME_Turn_Off handshake. > > +*/ > > + regmap_read(pcie->scfg, SCFG_PEXPMECR, &val); > > + val &= ~PEXPME(pcie->index); > > + regmap_write(pcie->scfg, SCFG_PEXPMECR, val); > > +} > > + > > +static int ls1043a_pcie_exit_from_l2(struct dw_pcie_rp *pp) > > +{ > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > + u32 val; > > + > > + /* > > +* Only way let PEX module exit L2 is do a software reset. > > Same comment applies as patch 2/4. > > - Mani > > -- > மணிவண்ணன் சதாசிவம்
Re: [PATCH v3 3/4] PCI: layerscape: Rename pf_* as pf_lut_*
On Thu, Nov 02, 2023 at 11:03:14PM +0530, Manivannan Sadhasivam wrote: > On Tue, Oct 17, 2023 at 03:31:44PM -0400, Frank Li wrote: > > 'pf' and 'lut' is just difference name in difference chips, but basic it is > > a MMIO base address plus an offset. > > > > Rename it to avoid duplicate pf_* and lut_* in driver. > > > > "pci-layerscape-ep.c" uses "ls_lut_" prefix and now you are using "pf_lut_". > May > I know the difference between these two? Can we just use a common name? Some chip use name lut, some chip use name pf. I think ls_pcie_pf_lut_*() is better name then 'ls_lut_' in pci-layerscape-ep.c to align with spec. If need, I can rename "ls_lut_" in "pci-layerscape-ep.c" later. Frank > > - Mani > > > Signed-off-by: Frank Li > > --- > > > > Notes: > > change from v1 to v3 > > - new patch at v3 > > > > drivers/pci/controller/dwc/pci-layerscape.c | 34 ++--- > > 1 file changed, 17 insertions(+), 17 deletions(-) > > > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > > b/drivers/pci/controller/dwc/pci-layerscape.c > > index 6f47cfe146c44..4b663b20d8612 100644 > > --- a/drivers/pci/controller/dwc/pci-layerscape.c > > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > > @@ -46,7 +46,7 @@ > > #define LS_PCIE_DRV_SCFG BIT(0) > > > > struct ls_pcie_drvdata { > > - const u32 pf_off; > > + const u32 pf_lut_off; > > const struct dw_pcie_host_ops *ops; > > int (*exit_from_l2)(struct dw_pcie_rp *pp); > > int flags; > > @@ -56,13 +56,13 @@ struct ls_pcie_drvdata { > > struct ls_pcie { > > struct dw_pcie *pci; > > const struct ls_pcie_drvdata *drvdata; > > - void __iomem *pf_base; > > + void __iomem *pf_lut_base; > > struct regmap *scfg; > > int index; > > bool big_endian; > > }; > > > > -#define ls_pcie_pf_readl_addr(addr)ls_pcie_pf_readl(pcie, addr) > > +#define ls_pcie_pf_lut_readl_addr(addr)ls_pcie_pf_lut_readl(pcie, addr) > > #define to_ls_pcie(x) dev_get_drvdata((x)->dev) > > > > static bool ls_pcie_is_bridge(struct ls_pcie *pcie) > > @@ -103,20 +103,20 @@ static void ls_pcie_fix_error_response(struct ls_pcie > > *pcie) > > iowrite32(PCIE_ABSERR_SETTING, pci->dbi_base + PCIE_ABSERR); > > } > > > > -static u32 ls_pcie_pf_readl(struct ls_pcie *pcie, u32 off) > > +static u32 ls_pcie_pf_lut_readl(struct ls_pcie *pcie, u32 off) > > { > > if (pcie->big_endian) > > - return ioread32be(pcie->pf_base + off); > > + return ioread32be(pcie->pf_lut_base + off); > > > > - return ioread32(pcie->pf_base + off); > > + return ioread32(pcie->pf_lut_base + off); > > } > > > > -static void ls_pcie_pf_writel(struct ls_pcie *pcie, u32 off, u32 val) > > +static void ls_pcie_pf_lut_writel(struct ls_pcie *pcie, u32 off, u32 val) > > { > > if (pcie->big_endian) > > - iowrite32be(val, pcie->pf_base + off); > > + iowrite32be(val, pcie->pf_lut_base + off); > > else > > - iowrite32(val, pcie->pf_base + off); > > + iowrite32(val, pcie->pf_lut_base + off); > > } > > > > static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > > @@ -126,11 +126,11 @@ static void ls_pcie_send_turnoff_msg(struct > > dw_pcie_rp *pp) > > u32 val; > > int ret; > > > > - val = ls_pcie_pf_readl(pcie, LS_PCIE_PF_MCR); > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_PF_MCR); > > val |= PF_MCR_PTOMR; > > - ls_pcie_pf_writel(pcie, LS_PCIE_PF_MCR, val); > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_PF_MCR, val); > > > > - ret = readx_poll_timeout(ls_pcie_pf_readl_addr, LS_PCIE_PF_MCR, > > + ret = readx_poll_timeout(ls_pcie_pf_lut_readl_addr, LS_PCIE_PF_MCR, > > val, !(val & PF_MCR_PTOMR), > > PCIE_PME_TO_L2_TIMEOUT_US/10, > > PCIE_PME_TO_L2_TIMEOUT_US); > > @@ -149,15 +149,15 @@ static int ls_pcie_exit_from_l2(struct dw_pcie_rp *pp) > > * Set PF_MCR_EXL2S bit in LS_PCIE_PF_MCR register for the link > > * to exit L2 state. > > */ > > - val = ls_pcie_pf_readl(pcie, LS_PCIE_PF_MCR); > > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_PF_MCR); > > val |= PF_MCR_EXL2S; > > - ls_pcie_pf_writel(pcie, LS_PCIE_PF_MCR, val); > > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_PF_MCR, val); > > > > /* > > * L2 exit timeout of 10ms is not defined in the specifications, > > * it was chosen based on empirical observations. > > */ > > - ret = readx_poll_timeout(ls_pcie_pf_readl_addr, LS_PCIE_PF_MCR, > > + ret = readx_poll_timeout(ls_pcie_pf_lut_readl_addr, LS_PCIE_PF_MCR, > > val, !(val & PF_MCR_EXL2S), > > 1000, > > 1); > > @@ -245,7 +245,7 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = { > > }; > > > > static const struct ls_pcie_drvdata layerscape_drvdata = { > > - .pf_off = 0xc, > > + .pf_lut_off = 0xc, >
Re: [PATCH v3 2/4] PCI: layerscape: Add suspend/resume for ls1021a
On Thu, Nov 02, 2023 at 10:58:09PM +0530, Manivannan Sadhasivam wrote: > On Tue, Oct 17, 2023 at 03:31:43PM -0400, Frank Li wrote: > > ls1021a add suspend/resume support. > > > > Implement callback ls1021a_pcie_send_turnoff_msg(), which write scfg's > > SCFG_PEXPMWRCR to issue PME_Turn_off message. > > > > Implement ls1021a_pcie_exit_from_l2() to let controller exit L2 state. > > > > I'd like to reword it to better reflect what the patch does: > > "In the suspend path, PME_Turn_Off message is sent to the endpoint to > transition > the link to L2/L3_Ready state. In this SoC, there is no way to check if the > controller has received the PME_To_Ack from the endpoint or not. So to be on > the > safer side, the driver just waits for PCIE_PME_TO_L2_TIMEOUT_US before > asserting > the SoC specific PMXMTTURNOFF bit to complete the PME_Turn_Off handshake. This > link would then enter L2/L3 state depending on the VAUX supply. > > In the resume path, the link is brought back from L2 to L0 by doing a software > reset." > > Although I do have questions on the resume behavior below. > > > Signed-off-by: Frank Li > > --- > > > > Notes: > > Change from v2 to v3 > > - update according to mani's feedback > > change from v1 to v2 > > - change subject 'a' to 'A' > > > > drivers/pci/controller/dwc/pci-layerscape.c | 86 - > > 1 file changed, 85 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > > b/drivers/pci/controller/dwc/pci-layerscape.c > > index aea89926bcc4f..6f47cfe146c44 100644 > > --- a/drivers/pci/controller/dwc/pci-layerscape.c > > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > > @@ -35,11 +35,21 @@ > > #define PF_MCR_PTOMR BIT(0) > > #define PF_MCR_EXL2S BIT(1) > > > > +/* LS1021A PEXn PM Write Control Register */ > > +#define SCFG_PEXPMWRCR(idx)(0x5c + (idx) * 0x64) > > +#define PMXMTTURNOFF BIT(31) > > +#define SCFG_PEXSFTRSTCR 0x190 > > +#define PEXSR(idx) BIT(idx) > > + > > #define PCIE_IATU_NUM 6 > > > > +#define LS_PCIE_DRV_SCFG BIT(0) > > + > > struct ls_pcie_drvdata { > > const u32 pf_off; > > + const struct dw_pcie_host_ops *ops; > > int (*exit_from_l2)(struct dw_pcie_rp *pp); > > + int flags; > > Why not "bool scfg_support"? It will be easy to add new flag if need in future. > > > bool pm_support; > > }; > > > > @@ -47,6 +57,8 @@ struct ls_pcie { > > struct dw_pcie *pci; > > const struct ls_pcie_drvdata *drvdata; > > void __iomem *pf_base; > > + struct regmap *scfg; > > + int index; > > bool big_endian; > > }; > > > > @@ -171,13 +183,65 @@ static int ls_pcie_host_init(struct dw_pcie_rp *pp) > > return 0; > > } > > > > +static void ls1021a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > > +{ > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > + u32 val; > > + > > + /* Send PME_Turn_Off message */ > > + regmap_read(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), &val); > > + val |= PMXMTTURNOFF; > > + regmap_write(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), val); > > + > > + /* > > +* There is no specific register to check for PME_To_Ack from endpoint. > > +* So on the safe side, wait for PCIE_PME_TO_L2_TIMEOUT_US. > > +*/ > > + mdelay(PCIE_PME_TO_L2_TIMEOUT_US/1000); > > + > > + /* > > +* Layerscape hardware reference manual recommends clearing the > > PMXMTTURNOFF bit > > +* to complete the PME_Turn_Off handshake. > > +*/ > > + regmap_read(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), &val); > > + val &= ~PMXMTTURNOFF; > > + regmap_write(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), val); > > +} > > + > > +static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp *pp) > > +{ > > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > + struct ls_pcie *pcie = to_ls_pcie(pci); > > + u32 val; > > + > > + /* Only way exit from l2 is that do software reset */ > > So, what does exactly "software reset" mean? Are you resetting the endpoint or > some specific registers/blocks in the controlleri? No, it is PCIe controller reset. Not touch endpoint. > > Also, what if the link goes to L3 in the case of no VAUX? I am not exactly sure. it should be related with board design. I supposed not big difference! We can improve it when we met it. > > > + regmap_read(pcie->scfg, SCFG_PEXSFTRSTCR, &val); > > + val |= PEXSR(pcie->index); > > + regmap_write(pcie->scfg, SCFG_PEXSFTRSTCR, val); > > + > > + regmap_read(pcie->scfg, SCFG_PEXSFTRSTCR, &val); > > + val &= ~PEXSR(pcie->index); > > + regmap_write(pcie->scfg, SCFG_PEXSFTRSTCR, val); > > + > > + return 0; > > +} > > + > > static const struct dw_pcie_host_ops ls_pcie_host_ops = { > > .host_init = ls_pcie_host_init, > > .pme_turn_off = ls_pcie_send_turnoff_msg, > > }; > > > > +static const struct dw_pcie_host_ops ls1021a_pcie_h
Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On Thu, Nov 02, 2023, David Matlack wrote: > On Thu, Nov 2, 2023 at 9:03 AM Sean Christopherson wrote: > > > > On Thu, Nov 02, 2023, Paolo Bonzini wrote: > > > On 10/31/23 23:39, David Matlack wrote: > > > > > > Maybe can you sketch out how you see this proposal being extensible > > > > > > to > > > > > > using guest_memfd for shared mappings? > > > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is > > > > > needed. What's > > > > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM > > > > > needs to > > > > > ensure there are no outstanding references when converting back to > > > > > private. > > > > > > > > > > For TDX/SNP, assuming we don't find a performant and robust way to do > > > > > in-place > > > > > conversions, a second fd+offset pair would be needed. > > > > Is there a way to support non-in-place conversions within a single > > > > guest_memfd? > > > > > > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest > > > memory. The hook would invalidate now-private parts if they have a VMA, > > > causing a SIGSEGV/EFAULT if the host touches them. > > > > > > It would forbid mappings from multiple gfns to a single offset of the > > > guest_memfd, because then the shared vs. private attribute would be tied > > > to > > > the offset. This should not be a problem; for example, in the case of > > > SNP, > > > the RMP already requires a single mapping from host physical address to > > > guest physical address. > > > > I don't see how this can work. It's not a M:1 scenario (where M is > > multiple gfns), > > it's a 1:N scenario (wheren N is multiple offsets). The *gfn* doesn't > > change on > > a conversion, what needs to change to do non-in-place conversion is the > > pfn, which > > is effectively the guest_memfd+offset pair. > > > > So yes, we *could* support non-in-place conversions within a single > > guest_memfd, > > but it would require a second offset, > > Why can't KVM free the existing page at guest_memfd+offset and > allocate a new one when doing non-in-place conversions? Oh, I see what you're suggesting. Eww. It's certainly possible, but it would largely defeat the purpose of why we are adding guest_memfd in the first place. For TDX and SNP, the goal is to provide a simple, robust mechanism for isolating guest private memory so that it's all but impossible for the host to access private memory. As things stand, memory for a given guest_memfd is either private or shared (assuming we support a second guest_memfd per memslot). I.e. there's no need to track whether a given page/folio in the guest_memfd is private vs. shared. We could use memory attributes, but that further complicates things when intrahost migration (and potentially other multi-user scenarios) comes along, i.e. when KVM supports linking multiple guest_memfd files to a single inode. We'd have to ensure that all "struct kvm" instances have identical PRIVATE attributes for a given *offset* in the inode. I'm not even sure how feasible that is for intrahost migration, and that's the *easy* case, because IIRC it's already a hard requirement that the source and destination have identical gnf=>guest_memfd bindings, i.e. KVM can somewhat easily reason about gfn attributes. But even then, that only helps with the actual migration of the VM, e.g. we'd still have to figure out how to deal with .mmap() and other shared vs. private actions when linking a new guest_memfd file against an existing inode. I haven't seen the pKVM patches for supporting .mmap(), so maybe this is already a solved problem, but I'd honestly be quite surprised if it all works correctly if/when KVM supports multiple files per inode. And I don't see what value non-in-place conversions would add. The value added by in-place conversions, aside from the obvious preservation of data, which isn't relevant to TDX/SNP, is that it doesn't require freeing and reallocating memory to avoid double-allocating for private vs. shared. That's especialy quite nice when hugepages are being used because reconstituing a hugepage "only" requires zapping SPTEs. But if KVM is freeing the private page, it's the same as punching a hole, probably quite literally, when mapping the gfn as shared. In every way I can think of, it's worse. E.g. it's more complex for KVM, and the PUNCH_HOLE => allocation operations must be serialized. Regarding double-allocating, I really, really think we should solve that in the guest. I.e. teach Linux-as-a-guest to aggressively convert at 2MiB granularity and avoid 4KiB conversions. 4KiB conversions aren't just a memory utilization problem, they're also a performance problem, e.g. shatters hugepages (which KVM doesn't yet support recovering) and increases TLB pressure for both stage-1 and stage-2 mappings.
Re: [PATCH v3 2/4] PCI: layerscape: Add suspend/resume for ls1021a
On Tue, Oct 17, 2023 at 03:31:43PM -0400, Frank Li wrote: > ls1021a add suspend/resume support. > > Implement callback ls1021a_pcie_send_turnoff_msg(), which write scfg's > SCFG_PEXPMWRCR to issue PME_Turn_off message. > > Implement ls1021a_pcie_exit_from_l2() to let controller exit L2 state. > I'd like to reword it to better reflect what the patch does: "In the suspend path, PME_Turn_Off message is sent to the endpoint to transition the link to L2/L3_Ready state. In this SoC, there is no way to check if the controller has received the PME_To_Ack from the endpoint or not. So to be on the safer side, the driver just waits for PCIE_PME_TO_L2_TIMEOUT_US before asserting the SoC specific PMXMTTURNOFF bit to complete the PME_Turn_Off handshake. This link would then enter L2/L3 state depending on the VAUX supply. In the resume path, the link is brought back from L2 to L0 by doing a software reset." Although I do have questions on the resume behavior below. > Signed-off-by: Frank Li > --- > > Notes: > Change from v2 to v3 > - update according to mani's feedback > change from v1 to v2 > - change subject 'a' to 'A' > > drivers/pci/controller/dwc/pci-layerscape.c | 86 - > 1 file changed, 85 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > b/drivers/pci/controller/dwc/pci-layerscape.c > index aea89926bcc4f..6f47cfe146c44 100644 > --- a/drivers/pci/controller/dwc/pci-layerscape.c > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > @@ -35,11 +35,21 @@ > #define PF_MCR_PTOMR BIT(0) > #define PF_MCR_EXL2S BIT(1) > > +/* LS1021A PEXn PM Write Control Register */ > +#define SCFG_PEXPMWRCR(idx) (0x5c + (idx) * 0x64) > +#define PMXMTTURNOFF BIT(31) > +#define SCFG_PEXSFTRSTCR 0x190 > +#define PEXSR(idx) BIT(idx) > + > #define PCIE_IATU_NUM6 > > +#define LS_PCIE_DRV_SCFG BIT(0) > + > struct ls_pcie_drvdata { > const u32 pf_off; > + const struct dw_pcie_host_ops *ops; > int (*exit_from_l2)(struct dw_pcie_rp *pp); > + int flags; Why not "bool scfg_support"? > bool pm_support; > }; > > @@ -47,6 +57,8 @@ struct ls_pcie { > struct dw_pcie *pci; > const struct ls_pcie_drvdata *drvdata; > void __iomem *pf_base; > + struct regmap *scfg; > + int index; > bool big_endian; > }; > > @@ -171,13 +183,65 @@ static int ls_pcie_host_init(struct dw_pcie_rp *pp) > return 0; > } > > +static void ls1021a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > +{ > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > + struct ls_pcie *pcie = to_ls_pcie(pci); > + u32 val; > + > + /* Send PME_Turn_Off message */ > + regmap_read(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), &val); > + val |= PMXMTTURNOFF; > + regmap_write(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), val); > + > + /* > + * There is no specific register to check for PME_To_Ack from endpoint. > + * So on the safe side, wait for PCIE_PME_TO_L2_TIMEOUT_US. > + */ > + mdelay(PCIE_PME_TO_L2_TIMEOUT_US/1000); > + > + /* > + * Layerscape hardware reference manual recommends clearing the > PMXMTTURNOFF bit > + * to complete the PME_Turn_Off handshake. > + */ > + regmap_read(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), &val); > + val &= ~PMXMTTURNOFF; > + regmap_write(pcie->scfg, SCFG_PEXPMWRCR(pcie->index), val); > +} > + > +static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp *pp) > +{ > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > + struct ls_pcie *pcie = to_ls_pcie(pci); > + u32 val; > + > + /* Only way exit from l2 is that do software reset */ So, what does exactly "software reset" mean? Are you resetting the endpoint or some specific registers/blocks in the controller? Also, what if the link goes to L3 in the case of no VAUX? > + regmap_read(pcie->scfg, SCFG_PEXSFTRSTCR, &val); > + val |= PEXSR(pcie->index); > + regmap_write(pcie->scfg, SCFG_PEXSFTRSTCR, val); > + > + regmap_read(pcie->scfg, SCFG_PEXSFTRSTCR, &val); > + val &= ~PEXSR(pcie->index); > + regmap_write(pcie->scfg, SCFG_PEXSFTRSTCR, val); > + > + return 0; > +} > + > static const struct dw_pcie_host_ops ls_pcie_host_ops = { > .host_init = ls_pcie_host_init, > .pme_turn_off = ls_pcie_send_turnoff_msg, > }; > > +static const struct dw_pcie_host_ops ls1021a_pcie_host_ops = { > + .host_init = ls_pcie_host_init, > + .pme_turn_off = ls1021a_pcie_send_turnoff_msg, > +}; > + > static const struct ls_pcie_drvdata ls1021a_drvdata = { > - .pm_support = false, > + .pm_support = true, > + .ops = &ls1021a_pcie_host_ops, > + .exit_from_l2 = ls1021a_pcie_exit_from_l2, > + .flags = LS_PCIE_DRV_SCFG, > }; > > static const struct ls_pcie_drvdata layerscape_drvdata = { > @@ -205,6 +269,8 @@ static int ls_pcie_probe(struct platform_device *pdev) > str
Re: [PATCH v3 1/4] PCI: layerscape: Add function pointer for exit_from_l2()
On Thu, Nov 02, 2023 at 10:28:08PM +0530, Manivannan Sadhasivam wrote: > On Tue, Oct 17, 2023 at 03:31:42PM -0400, Frank Li wrote: > > Since difference SoCs require different sequence for exiting L2, let's add > > a separate "exit_from_l2()" callback. This callback can be used to execute > > SoC specific sequence. > > > > I missed the fact that this patch honors the return value of the callback > (which > was ignored previously). So this should be added to the description as well. How about add below? "Change ls_pcie_exit_from_l2() return value from void to int. Return error when exit_from_l2() failure to exit suspend flow." Frank > > > Signed-off-by: Frank Li > > With that, > > Reviewed-by: Manivannan Sadhasivam > > - Mani > > > --- > > > > Notes: > > Change from v2 to v3 > > - fixed according to mani's feedback > > 1. update commit message > > 2. move dw_pcie_host_ops to next patch > > 3. check return value from exit_from_l2() > > Change from v1 to v2 > > - change subject 'a' to 'A' > > > > Change from v1 to v2 > > - change subject 'a' to 'A' > > > > drivers/pci/controller/dwc/pci-layerscape.c | 11 +-- > > 1 file changed, 9 insertions(+), 2 deletions(-) > > > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > > b/drivers/pci/controller/dwc/pci-layerscape.c > > index 37956e09c65bd..aea89926bcc4f 100644 > > --- a/drivers/pci/controller/dwc/pci-layerscape.c > > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > > @@ -39,6 +39,7 @@ > > > > struct ls_pcie_drvdata { > > const u32 pf_off; > > + int (*exit_from_l2)(struct dw_pcie_rp *pp); > > bool pm_support; > > }; > > > > @@ -125,7 +126,7 @@ static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp > > *pp) > > dev_err(pcie->pci->dev, "PME_Turn_off timeout\n"); > > } > > > > -static void ls_pcie_exit_from_l2(struct dw_pcie_rp *pp) > > +static int ls_pcie_exit_from_l2(struct dw_pcie_rp *pp) > > { > > struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > > struct ls_pcie *pcie = to_ls_pcie(pci); > > @@ -150,6 +151,8 @@ static void ls_pcie_exit_from_l2(struct dw_pcie_rp *pp) > > 1); > > if (ret) > > dev_err(pcie->pci->dev, "L2 exit timeout\n"); > > + > > + return ret; > > } > > > > static int ls_pcie_host_init(struct dw_pcie_rp *pp) > > @@ -180,6 +183,7 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = { > > static const struct ls_pcie_drvdata layerscape_drvdata = { > > .pf_off = 0xc, > > .pm_support = true, > > + .exit_from_l2 = ls_pcie_exit_from_l2, > > }; > > > > static const struct of_device_id ls_pcie_of_match[] = { > > @@ -247,11 +251,14 @@ static int ls_pcie_suspend_noirq(struct device *dev) > > static int ls_pcie_resume_noirq(struct device *dev) > > { > > struct ls_pcie *pcie = dev_get_drvdata(dev); > > + int ret; > > > > if (!pcie->drvdata->pm_support) > > return 0; > > > > - ls_pcie_exit_from_l2(&pcie->pci->pp); > > + ret = pcie->drvdata->exit_from_l2(&pcie->pci->pp); > > + if (ret) > > + return ret; > > > > return dw_pcie_resume_noirq(pcie->pci); > > } > > -- > > 2.34.1 > > > > -- > மணிவண்ணன் சதாசிவம்
Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On Thu, Nov 02, 2023, Paolo Bonzini wrote: > On 10/31/23 23:39, David Matlack wrote: > > > > Maybe can you sketch out how you see this proposal being extensible to > > > > using guest_memfd for shared mappings? > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is needed. > > > What's > > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM > > > needs to > > > ensure there are no outstanding references when converting back to > > > private. > > > > > > For TDX/SNP, assuming we don't find a performant and robust way to do > > > in-place > > > conversions, a second fd+offset pair would be needed. > > Is there a way to support non-in-place conversions within a single > > guest_memfd? > > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest > memory. The hook would invalidate now-private parts if they have a VMA, > causing a SIGSEGV/EFAULT if the host touches them. > > It would forbid mappings from multiple gfns to a single offset of the > guest_memfd, because then the shared vs. private attribute would be tied to > the offset. This should not be a problem; for example, in the case of SNP, > the RMP already requires a single mapping from host physical address to > guest physical address. I don't see how this can work. It's not a M:1 scenario (where M is multiple gfns), it's a 1:N scenario (wheren N is multiple offsets). The *gfn* doesn't change on a conversion, what needs to change to do non-in-place conversion is the pfn, which is effectively the guest_memfd+offset pair. So yes, we *could* support non-in-place conversions within a single guest_memfd, but it would require a second offset, at which point it makes sense to add a second file descriptor as well. Userspace could still use a single guest_memfd instance, i.e. pass in the same file descriptor but different offsets.
Re: [PATCH v3 3/4] PCI: layerscape: Rename pf_* as pf_lut_*
On Tue, Oct 17, 2023 at 03:31:44PM -0400, Frank Li wrote: > 'pf' and 'lut' is just difference name in difference chips, but basic it is > a MMIO base address plus an offset. > > Rename it to avoid duplicate pf_* and lut_* in driver. > "pci-layerscape-ep.c" uses "ls_lut_" prefix and now you are using "pf_lut_". May I know the difference between these two? Can we just use a common name? - Mani > Signed-off-by: Frank Li > --- > > Notes: > change from v1 to v3 > - new patch at v3 > > drivers/pci/controller/dwc/pci-layerscape.c | 34 ++--- > 1 file changed, 17 insertions(+), 17 deletions(-) > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > b/drivers/pci/controller/dwc/pci-layerscape.c > index 6f47cfe146c44..4b663b20d8612 100644 > --- a/drivers/pci/controller/dwc/pci-layerscape.c > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > @@ -46,7 +46,7 @@ > #define LS_PCIE_DRV_SCFG BIT(0) > > struct ls_pcie_drvdata { > - const u32 pf_off; > + const u32 pf_lut_off; > const struct dw_pcie_host_ops *ops; > int (*exit_from_l2)(struct dw_pcie_rp *pp); > int flags; > @@ -56,13 +56,13 @@ struct ls_pcie_drvdata { > struct ls_pcie { > struct dw_pcie *pci; > const struct ls_pcie_drvdata *drvdata; > - void __iomem *pf_base; > + void __iomem *pf_lut_base; > struct regmap *scfg; > int index; > bool big_endian; > }; > > -#define ls_pcie_pf_readl_addr(addr) ls_pcie_pf_readl(pcie, addr) > +#define ls_pcie_pf_lut_readl_addr(addr) ls_pcie_pf_lut_readl(pcie, addr) > #define to_ls_pcie(x)dev_get_drvdata((x)->dev) > > static bool ls_pcie_is_bridge(struct ls_pcie *pcie) > @@ -103,20 +103,20 @@ static void ls_pcie_fix_error_response(struct ls_pcie > *pcie) > iowrite32(PCIE_ABSERR_SETTING, pci->dbi_base + PCIE_ABSERR); > } > > -static u32 ls_pcie_pf_readl(struct ls_pcie *pcie, u32 off) > +static u32 ls_pcie_pf_lut_readl(struct ls_pcie *pcie, u32 off) > { > if (pcie->big_endian) > - return ioread32be(pcie->pf_base + off); > + return ioread32be(pcie->pf_lut_base + off); > > - return ioread32(pcie->pf_base + off); > + return ioread32(pcie->pf_lut_base + off); > } > > -static void ls_pcie_pf_writel(struct ls_pcie *pcie, u32 off, u32 val) > +static void ls_pcie_pf_lut_writel(struct ls_pcie *pcie, u32 off, u32 val) > { > if (pcie->big_endian) > - iowrite32be(val, pcie->pf_base + off); > + iowrite32be(val, pcie->pf_lut_base + off); > else > - iowrite32(val, pcie->pf_base + off); > + iowrite32(val, pcie->pf_lut_base + off); > } > > static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > @@ -126,11 +126,11 @@ static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp > *pp) > u32 val; > int ret; > > - val = ls_pcie_pf_readl(pcie, LS_PCIE_PF_MCR); > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_PF_MCR); > val |= PF_MCR_PTOMR; > - ls_pcie_pf_writel(pcie, LS_PCIE_PF_MCR, val); > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_PF_MCR, val); > > - ret = readx_poll_timeout(ls_pcie_pf_readl_addr, LS_PCIE_PF_MCR, > + ret = readx_poll_timeout(ls_pcie_pf_lut_readl_addr, LS_PCIE_PF_MCR, >val, !(val & PF_MCR_PTOMR), >PCIE_PME_TO_L2_TIMEOUT_US/10, >PCIE_PME_TO_L2_TIMEOUT_US); > @@ -149,15 +149,15 @@ static int ls_pcie_exit_from_l2(struct dw_pcie_rp *pp) >* Set PF_MCR_EXL2S bit in LS_PCIE_PF_MCR register for the link >* to exit L2 state. >*/ > - val = ls_pcie_pf_readl(pcie, LS_PCIE_PF_MCR); > + val = ls_pcie_pf_lut_readl(pcie, LS_PCIE_PF_MCR); > val |= PF_MCR_EXL2S; > - ls_pcie_pf_writel(pcie, LS_PCIE_PF_MCR, val); > + ls_pcie_pf_lut_writel(pcie, LS_PCIE_PF_MCR, val); > > /* >* L2 exit timeout of 10ms is not defined in the specifications, >* it was chosen based on empirical observations. >*/ > - ret = readx_poll_timeout(ls_pcie_pf_readl_addr, LS_PCIE_PF_MCR, > + ret = readx_poll_timeout(ls_pcie_pf_lut_readl_addr, LS_PCIE_PF_MCR, >val, !(val & PF_MCR_EXL2S), >1000, >1); > @@ -245,7 +245,7 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = { > }; > > static const struct ls_pcie_drvdata layerscape_drvdata = { > - .pf_off = 0xc, > + .pf_lut_off = 0xc, > .pm_support = true, > .exit_from_l2 = ls_pcie_exit_from_l2, > }; > @@ -295,7 +295,7 @@ static int ls_pcie_probe(struct platform_device *pdev) > > pcie->big_endian = of_property_read_bool(dev->of_node, "big-endian"); > > - pcie->pf_base = pci->dbi_base + pcie->drvdata->pf_off; > + pcie->pf_lut_base = pci->dbi_base + pcie->drvdata->pf_lut_off; > > if (pcie->drvdata->flags & LS_PCIE_DRV_SC
Re: [PATCH v3 4/4] PCI: layerscape: Add suspend/resume for ls1043a
On Tue, Oct 17, 2023 at 03:31:45PM -0400, Frank Li wrote: > ls1043a add suspend/resume support. > Implement ls1043a_pcie_send_turnoff_msg() to send PME_Turn_Off message. > Implement ls1043a_pcie_exit_from_l2() to exit from L2 state. > Please use the suggestion I gave in patch 2/4. > Signed-off-by: Frank Li > --- > > Notes: > Change from v2 to v3 > - Remove ls_pcie_lut_readl(writel) function > > Change from v1 to v2 > - Update subject 'a' to 'A' > > drivers/pci/controller/dwc/pci-layerscape.c | 86 - > 1 file changed, 85 insertions(+), 1 deletion(-) > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > b/drivers/pci/controller/dwc/pci-layerscape.c > index 4b663b20d8612..9656224960b0c 100644 > --- a/drivers/pci/controller/dwc/pci-layerscape.c > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > @@ -41,6 +41,15 @@ > #define SCFG_PEXSFTRSTCR 0x190 > #define PEXSR(idx) BIT(idx) > > +/* LS1043A PEX PME control register */ > +#define SCFG_PEXPMECR0x144 > +#define PEXPME(idx) BIT(31 - (idx) * 4) > + > +/* LS1043A PEX LUT debug register */ > +#define LS_PCIE_LDBG 0x7fc > +#define LDBG_SR BIT(30) > +#define LDBG_WE BIT(31) > + > #define PCIE_IATU_NUM6 > > #define LS_PCIE_DRV_SCFG BIT(0) > @@ -227,6 +236,68 @@ static int ls1021a_pcie_exit_from_l2(struct dw_pcie_rp > *pp) > return 0; > } > > +static void ls1043a_pcie_send_turnoff_msg(struct dw_pcie_rp *pp) > +{ > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > + struct ls_pcie *pcie = to_ls_pcie(pci); > + u32 val; > + > + if (!pcie->scfg) { > + dev_dbg(pcie->pci->dev, "SYSCFG is NULL\n"); > + return; > + } Why scfg is optional for this SoC and not for the other one added in patch 2/4? > + > + /* Send Turn_off message */ > + regmap_read(pcie->scfg, SCFG_PEXPMECR, &val); > + val |= PEXPME(pcie->index); > + regmap_write(pcie->scfg, SCFG_PEXPMECR, val); > + In my previous review, I asked you to use a common function and just pass the offsets, as the sequence is same for both the SoCs. But you ignored it :/ > + /* > + * There is no specific register to check for PME_To_Ack from endpoint. > + * So on the safe side, wait for PCIE_PME_TO_L2_TIMEOUT_US. > + */ > + mdelay(PCIE_PME_TO_L2_TIMEOUT_US/1000); > + > + /* > + * Layerscape hardware reference manual recommends clearing the > PMXMTTURNOFF bit > + * to complete the PME_Turn_Off handshake. > + */ > + regmap_read(pcie->scfg, SCFG_PEXPMECR, &val); > + val &= ~PEXPME(pcie->index); > + regmap_write(pcie->scfg, SCFG_PEXPMECR, val); > +} > + > +static int ls1043a_pcie_exit_from_l2(struct dw_pcie_rp *pp) > +{ > + struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > + struct ls_pcie *pcie = to_ls_pcie(pci); > + u32 val; > + > + /* > + * Only way let PEX module exit L2 is do a software reset. Same comment applies as patch 2/4. - Mani -- மணிவண்ணன் சதாசிவம்
Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
On Thu, Nov 2, 2023 at 4:38 PM Sean Christopherson wrote: > Actually, looking that this again, there's not actually a hard dependency on > THP. > A THP-enabled kernel _probably_ gives a higher probability of using > hugepages, > but mostly because THP selects COMPACTION, and I suppose because using THP for > other allocations reduces overall fragmentation. Yes, that's why I didn't even bother enabling it unless THP is enabled, but it makes even more sense to just try. > So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I > think > we should do the below (I verified KVM can create hugepages with THP=n). > We'll > need another capability, but (a) we probably should have that anyways and (b) > it > provides a cleaner path to adding PUD-sized hugepage support in the future. I wonder if we need KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE though. This should be a generic kernel API and in fact the sizes are available in a not-so-friendly format in /sys/kernel/mm/hugepages. We should just add /sys/kernel/mm/hugepages/sizes that contains "2097152 1073741824" on x86 (only the former if 1G pages are not supported). Plus: is this the best API if we need something else for 1G pages? Let's drop *this* patch and proceed incrementally. (Again, this is what I want to do with this final review: identify places that are stil sticky, and don't let them block the rest). Coincidentially we have an open spot next week at plumbers. Let's extend Fuad's section to cover more guestmem work. Paolo > diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c > b/tools/testing/selftests/kvm/guest_memfd_test.c > index c15de9852316..c9f449718fce 100644 > --- a/tools/testing/selftests/kvm/guest_memfd_test.c > +++ b/tools/testing/selftests/kvm/guest_memfd_test.c > @@ -201,6 +201,10 @@ int main(int argc, char *argv[]) > > TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD)); > > + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && > thp_configured()) > + TEST_ASSERT_EQ(get_trans_hugepagesz(), > + > kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE)); > + > page_size = getpagesize(); > total_size = page_size * 4; > > diff --git > a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c > b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c > index be311944e90a..245901587ed2 100644 > --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c > +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c > @@ -396,7 +396,7 @@ static void test_mem_conversions(enum > vm_mem_backing_src_type src_type, uint32_t > > vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << > KVM_HC_MAP_GPA_RANGE)); > > - if (backing_src_can_be_huge(src_type)) > + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE)) > memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > else > memfd_flags = 0; > > -- > From: Sean Christopherson > Date: Wed, 25 Oct 2023 16:26:41 -0700 > Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest > memory > > Extend guest_memfd to allow backing guest memory with hugepages. For now, > make hugepage utilization best-effort, i.e. fall back to non-huge mappings > if a hugepage can't be allocated. Guaranteeing hugepages would require a > dedicated memory pool and significantly more complexity and churn.. > > Require userspace to opt-in via a flag even though it's unlikely real use > cases will ever want to use order-0 pages, e.g. to give userspace a safety > valve in case hugepage support is buggy, and to allow for easier testing > of both paths. > > Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling > primarily deals with userspace page tables, which are explicitly not in > play for guest_memfd. Selecting THP does make obtaining hugepages more > likely, but only because THP selects CONFIG_COMPACTION. Don't select > CONFIG_COMPACTION either, because again it's not a hard dependency. > > For simplicity, require the guest_memfd size to be a multiple of the > hugepage size, e.g. so that KVM doesn't need to do bounds checking when > deciding whether or not to allocate a huge folio. > > When reporting the max order when KVM gets a pfn from guest_memfd, force > order-0 pages if the hugepage is not fully contained by the memslot > binding, e.g. if userspace requested hugepages but punches a hole in the > memslot bindings in order to emulate x86's VGA hole. > > Signed-off-by: Sean Christopherson > --- > Documentation/virt/kvm/api.rst | 17 + > include/uapi/linux/kvm.h | 3 ++ > virt/kvm/guest_memfd.c | 69 +- > virt/kvm/kvm_main.c| 4 ++ > 4 files changed, 84 insertions(+), 9 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index e82c69d5e755..ccdd5413920d 100644 > --- a/Documen
Re: [PATCH v3 1/4] PCI: layerscape: Add function pointer for exit_from_l2()
On Tue, Oct 17, 2023 at 03:31:42PM -0400, Frank Li wrote: > Since difference SoCs require different sequence for exiting L2, let's add > a separate "exit_from_l2()" callback. This callback can be used to execute > SoC specific sequence. > I missed the fact that this patch honors the return value of the callback (which was ignored previously). So this should be added to the description as well. > Signed-off-by: Frank Li With that, Reviewed-by: Manivannan Sadhasivam - Mani > --- > > Notes: > Change from v2 to v3 > - fixed according to mani's feedback > 1. update commit message > 2. move dw_pcie_host_ops to next patch > 3. check return value from exit_from_l2() > Change from v1 to v2 > - change subject 'a' to 'A' > > Change from v1 to v2 > - change subject 'a' to 'A' > > drivers/pci/controller/dwc/pci-layerscape.c | 11 +-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/drivers/pci/controller/dwc/pci-layerscape.c > b/drivers/pci/controller/dwc/pci-layerscape.c > index 37956e09c65bd..aea89926bcc4f 100644 > --- a/drivers/pci/controller/dwc/pci-layerscape.c > +++ b/drivers/pci/controller/dwc/pci-layerscape.c > @@ -39,6 +39,7 @@ > > struct ls_pcie_drvdata { > const u32 pf_off; > + int (*exit_from_l2)(struct dw_pcie_rp *pp); > bool pm_support; > }; > > @@ -125,7 +126,7 @@ static void ls_pcie_send_turnoff_msg(struct dw_pcie_rp > *pp) > dev_err(pcie->pci->dev, "PME_Turn_off timeout\n"); > } > > -static void ls_pcie_exit_from_l2(struct dw_pcie_rp *pp) > +static int ls_pcie_exit_from_l2(struct dw_pcie_rp *pp) > { > struct dw_pcie *pci = to_dw_pcie_from_pp(pp); > struct ls_pcie *pcie = to_ls_pcie(pci); > @@ -150,6 +151,8 @@ static void ls_pcie_exit_from_l2(struct dw_pcie_rp *pp) >1); > if (ret) > dev_err(pcie->pci->dev, "L2 exit timeout\n"); > + > + return ret; > } > > static int ls_pcie_host_init(struct dw_pcie_rp *pp) > @@ -180,6 +183,7 @@ static const struct ls_pcie_drvdata ls1021a_drvdata = { > static const struct ls_pcie_drvdata layerscape_drvdata = { > .pf_off = 0xc, > .pm_support = true, > + .exit_from_l2 = ls_pcie_exit_from_l2, > }; > > static const struct of_device_id ls_pcie_of_match[] = { > @@ -247,11 +251,14 @@ static int ls_pcie_suspend_noirq(struct device *dev) > static int ls_pcie_resume_noirq(struct device *dev) > { > struct ls_pcie *pcie = dev_get_drvdata(dev); > + int ret; > > if (!pcie->drvdata->pm_support) > return 0; > > - ls_pcie_exit_from_l2(&pcie->pci->pp); > + ret = pcie->drvdata->exit_from_l2(&pcie->pci->pp); > + if (ret) > + return ret; > > return dw_pcie_resume_noirq(pcie->pci); > } > -- > 2.34.1 > -- மணிவண்ணன் சதாசிவம்
Re: [PATCH v13 15/35] fs: Export anon_inode_getfile_secure() for use by KVM
On Fri, Oct 27, 2023 at 11:21:57AM -0700, Sean Christopherson wrote: > Export anon_inode_getfile_secure() so that it can be used by KVM to create > and manage file-based guest memory without need a fullblow filesystem. > The "standard" anon_inode_getfd() doesn't work for KVM's use case as KVM > needs a unique inode for each file, e.g. to be able to independently > manage the size and lifecycle of a given file. > > Note, KVM doesn't need a "secure" version, just unique inodes, i.e. ignore > the name. > > Signed-off-by: Sean Christopherson > --- Before we enshrine this misleading name let's rename this to: create_anon_inode_getfile() I don't claim it's a great name but it's better than *_secure() which is very confusing. So just: struct file *create_anon_inode_getfile(const char *name, const struct file_operations *fops, void *priv, int flags) May also just remove that context_inode argument from the exported function. The only other caller is io_uring. And neither it nor this patchset need the context_inode thing afaict. Merge conflict risk is extremely low so carrying that as part of this patchset is fine and shouldn't cause huge issues for you.
Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On Thu, Nov 2, 2023 at 9:03 AM Sean Christopherson wrote: > > On Thu, Nov 02, 2023, Paolo Bonzini wrote: > > On 10/31/23 23:39, David Matlack wrote: > > > > > Maybe can you sketch out how you see this proposal being extensible to > > > > > using guest_memfd for shared mappings? > > > > For in-place conversions, e.g. pKVM, no additional guest_memfd is > > > > needed. What's > > > > missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM > > > > needs to > > > > ensure there are no outstanding references when converting back to > > > > private. > > > > > > > > For TDX/SNP, assuming we don't find a performant and robust way to do > > > > in-place > > > > conversions, a second fd+offset pair would be needed. > > > Is there a way to support non-in-place conversions within a single > > > guest_memfd? > > > > For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest > > memory. The hook would invalidate now-private parts if they have a VMA, > > causing a SIGSEGV/EFAULT if the host touches them. > > > > It would forbid mappings from multiple gfns to a single offset of the > > guest_memfd, because then the shared vs. private attribute would be tied to > > the offset. This should not be a problem; for example, in the case of SNP, > > the RMP already requires a single mapping from host physical address to > > guest physical address. > > I don't see how this can work. It's not a M:1 scenario (where M is multiple > gfns), > it's a 1:N scenario (wheren N is multiple offsets). The *gfn* doesn't change > on > a conversion, what needs to change to do non-in-place conversion is the pfn, > which > is effectively the guest_memfd+offset pair. > > So yes, we *could* support non-in-place conversions within a single > guest_memfd, > but it would require a second offset, Why can't KVM free the existing page at guest_memfd+offset and allocate a new one when doing non-in-place conversions? > at which point it makes sense to add a > second file descriptor as well. Userspace could still use a single > guest_memfd > instance, i.e. pass in the same file descriptor but different offsets.
Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
On Thu, Nov 02, 2023, Paolo Bonzini wrote: > On 11/2/23 10:35, Huang, Kai wrote: > > IIUC KVM can already handle the case of poisoned > > page by sending signal to user app: > > > > static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, > > struct > > kvm_page_fault *fault) > > { > > ... > > > > if (fault->pfn == KVM_PFN_ERR_HWPOISON) { > > kvm_send_hwpoison_signal(fault->slot, > > fault->gfn); No, this doesn't work, because that signals the host virtual address unsigned long hva = gfn_to_hva_memslot(slot, gfn); send_sig_mceerr(BUS_MCEERR_AR, (void __user *)hva, PAGE_SHIFT, current); which is the *shared* page. > > return RET_PF_RETRY; > > } > > } > > EHWPOISON is not implemented by this series, so it should be left out of the > documentation. EHWPOISON *is* implemented. kvm_gmem_get_pfn() returns -EWPOISON as appropriate, and kvm_faultin_pfn() returns that directly without going through kvm_handle_error_pfn(). kvm_faultin_pfn_private() | |-> kvm_gmem_get_pfn() | |-> if (folio_test_hwpoison(folio)) { r = -EHWPOISON; goto out_unlock; } | |-> r = kvm_gmem_get_pfn(vcpu->kvm, fault->slot, fault->gfn, &fault->pfn, &max_order); if (r) { kvm_mmu_prepare_memory_fault_exit(vcpu, fault); return r; } | |-> ret = __kvm_faultin_pfn(vcpu, fault); if (ret != RET_PF_CONTINUE) return ret; if (unlikely(is_error_pfn(fault->pfn))) return kvm_handle_error_pfn(vcpu, fault);
Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
On Thu, Nov 02, 2023, Xiaoyao Li wrote: > On 11/2/2023 1:36 AM, Sean Christopherson wrote: > > > KVM_CAP_MEMORY_FAULT_INFO is x86 only, is it better to put this function > > > to > > > ? > > I'd prefer to keep it in generic code, as it's highly likely to end up there > > sooner than later. There's a known use case for ARM (exit to userspace on > > missing > > userspace mapping[*]), and I'm guessing pKVM (also ARM) will also utilize > > this API. > > > > [*]https://lore.kernel.org/all/20230908222905.1321305-8-amoor...@google.com > > I wonder how this CAP is supposed to be checked in userspace, for guest > memfd case? It's basically useless for guest_memfd. > if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO) && > run->exit_reason == KVM_EXIT_MEMORY_FAULT) > abort("unexpected KVM_EXIT_MEMORY_FAULT"); > > In my implementation of QEMU patches, I find it's unnecessary. When > userspace gets an exit with KVM_EXIT_MEMORY_FAULT, it implies > "KVM_CAP_MEMORY_FAULT_INFO". > > So I don't see how it is necessary in this series. Whether it's necessary or > not for [*], I don't have the answer but we can leave the discussion to that > patch series. It's not strictly necessary there either. However, Oliver felt (and presumably still feels) quite strongly, and I agree, that neither reporting extra information shouldn't be tightly coupled to KVM_CAP_EXIT_ON_MISSING or KVM_CAP_GUEST_MEMFD. E.g. if userspace develops a "standalone" use case for KVM_CAP_MEMORY_FAULT_INFO, userspace should be able to check for support without having to take a dependency on KVM_CAP_GUEST_MEMFD, especially since because KVM_CAP_GUEST_MEMFD may not be supported, i.e. userspace should be able to do: if (!kvm_check_extension(s, KVM_CAP_MEMORY_FAULT_INFO)) abort("KVM_CAP_MEMORY_FAULT_INFO required for fancy feature XYZ");
Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On 10/31/23 23:39, David Matlack wrote: Maybe can you sketch out how you see this proposal being extensible to using guest_memfd for shared mappings? For in-place conversions, e.g. pKVM, no additional guest_memfd is needed. What's missing there is the ability to (safely) mmap() guest_memfd, e.g. KVM needs to ensure there are no outstanding references when converting back to private. For TDX/SNP, assuming we don't find a performant and robust way to do in-place conversions, a second fd+offset pair would be needed. Is there a way to support non-in-place conversions within a single guest_memfd? For TDX/SNP, you could have a hook from KVM_SET_MEMORY_ATTRIBUTES to guest memory. The hook would invalidate now-private parts if they have a VMA, causing a SIGSEGV/EFAULT if the host touches them. It would forbid mappings from multiple gfns to a single offset of the guest_memfd, because then the shared vs. private attribute would be tied to the offset. This should not be a problem; for example, in the case of SNP, the RMP already requires a single mapping from host physical address to guest physical address. Paolo
Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
On Thu, Nov 02, 2023, Kai Huang wrote: > On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote: > > On Wed, Nov 01, 2023, Kai Huang wrote: > > > > > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > > > > +-- > > > > + > > > > +:Architectures: x86 > > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > > > > + > > > > +The presence of this capability indicates that KVM_RUN will fill > > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, > > > > e.g. if > > > > +there is a valid memslot but no backing VMA for the corresponding host > > > > virtual > > > > +address. > > > > + > > > > +The information in kvm_run.memory_fault is valid if and only if > > > > KVM_RUN returns > > > > +an error with errno=EFAULT or errno=EHWPOISON *and* > > > > kvm_run.exit_reason is set > > > > +to KVM_EXIT_MEMORY_FAULT. > > > > > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal > > > implementation. > > > > The errno that is returned to userspace is ABI. In KVM, it's a _very_ > > poorly > > defined ABI for the vast majority of ioctls(), but it's still technically > > ABI. > > KVM gets away with being cavalier with errno because the vast majority of > > errors > > are considered fatal by userespace, i.e. in most cases, userspace simply > > doesn't > > care about the exact errno. > > > > A good example is KVM_RUN with -EINTR; if KVM were to return something > > other than > > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would > > fall over. > > > > > Is it better to relax the validity of kvm_run.memory_fault when > > > KVM_RUN returns any -errno? > > > > Not unless there's a need to do so, and if there is then we can update the > > documentation accordingly. If KVM's ABI is that kvm_run.memory_fault is > > valid > > for any errno, then KVM would need to purge kvm_run.exit_reason super early > > in > > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being > > misinterpreted as KVM_EXIT_MEMORY_FAULT. And purging exit_reason super > > early is > > subtly tricky because KVM's (again, poorly documented) ABI is that *some* > > exit > > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or > > with a > > pending signal). > > > > https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com > > > > > > Agreed with not to relax to any errno. However using -EFAULT as part of ABI > definition seems a little bit dangerous, e.g., someone could accidentally or > mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely > different code path, etc. -EINTR has well defined meaning, but -EFAULT (which > is "Bad address") seems doesn't but I am not sure either. :-) KVM has returned -EFAULT since forever, i.e. it's effectively already part of the ABI. I doubt there's a userspace that relies precisely on -EFAULT, but userspace definitely will be confused if KVM returns '0' where KVM used to return -EFAULT. And so if we want to return '0', it needs to be opt-in, which means forcing userspace to enable a capability *and* requires code in KVM to conditionally return '0' instead of -EFAULT/-EHWPOISON. > One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns > KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc > fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just > return -EFAULT. If kvm_run.exit_reason isn't purged early then is it possible > to have some issue here? Well, yeah, but that's exactly why this series has a patch to reset exit_reason. The solution to "if KVM is buggy then bad things happen" is to not have KVM bugs :-)
Re: [PATCH v13 17/35] KVM: Add transparent hugepage support for dedicated guest memory
On Thu, Nov 02, 2023, Paolo Bonzini wrote: > On Wed, Nov 1, 2023 at 11:35 PM Sean Christopherson wrote: > > > > On Wed, Nov 01, 2023, Paolo Bonzini wrote: > > > On 11/1/23 17:36, Sean Christopherson wrote: > > > > Can you post a fixup patch? It's not clear to me exactly what behavior > > > > you intend > > > > to end up with. > > > > > > Sure, just this: > > > > > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > > > index 7d1a33c2ad42..34fd070e03d9 100644 > > > --- a/virt/kvm/guest_memfd.c > > > +++ b/virt/kvm/guest_memfd.c > > > @@ -430,10 +430,7 @@ int kvm_gmem_create(struct kvm *kvm, struct > > > kvm_create_guest_memfd *args) > > > { > > > loff_t size = args->size; > > > u64 flags = args->flags; > > > - u64 valid_flags = 0; > > > - > > > - if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE)) > > > - valid_flags |= KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > > > + u64 valid_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; > > > if (flags & ~valid_flags) > > > return -EINVAL; > > > @@ -441,11 +438,9 @@ int kvm_gmem_create(struct kvm *kvm, struct > > > kvm_create_guest_memfd *args) > > > if (size < 0 || !PAGE_ALIGNED(size)) > > > return -EINVAL; > > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > > if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) && > > > !IS_ALIGNED(size, HPAGE_PMD_SIZE)) > > > return -EINVAL; > > > -#endif > > > > That won't work, HPAGE_PMD_SIZE is valid only for > > CONFIG_TRANSPARENT_HUGEPAGE=y. > > > > #else /* CONFIG_TRANSPARENT_HUGEPAGE */ > > #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; }) > > #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; }) > > #define HPAGE_PMD_SIZE ({ BUILD_BUG(); 0; }) > > Would have caught it when actually testing it, I guess. :) It has to > be PMD_SIZE, possibly with > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > BUILD_BUG_ON(HPAGE_PMD_SIZE != PMD_SIZE); > #endif Yeah, that works for me. Actually, looking that this again, there's not actually a hard dependency on THP. A THP-enabled kernel _probably_ gives a higher probability of using hugepages, but mostly because THP selects COMPACTION, and I suppose because using THP for other allocations reduces overall fragmentation. So rather than honor KVM_GUEST_MEMFD_ALLOW_HUGEPAGE iff THP is enabled, I think we should do the below (I verified KVM can create hugepages with THP=n). We'll need another capability, but (a) we probably should have that anyways and (b) it provides a cleaner path to adding PUD-sized hugepage support in the future. And then adjust the tests like so: diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c index c15de9852316..c9f449718fce 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -201,6 +201,10 @@ int main(int argc, char *argv[]) TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD)); + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE) && thp_configured()) + TEST_ASSERT_EQ(get_trans_hugepagesz(), + kvm_check_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE)); + page_size = getpagesize(); total_size = page_size * 4; diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c index be311944e90a..245901587ed2 100644 --- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c +++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c @@ -396,7 +396,7 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE)); - if (backing_src_can_be_huge(src_type)) + if (kvm_has_cap(KVM_CAP_GUEST_MEMFD_HUGEPAGE_PMD_SIZE)) memfd_flags = KVM_GUEST_MEMFD_ALLOW_HUGEPAGE; else memfd_flags = 0; -- From: Sean Christopherson Date: Wed, 25 Oct 2023 16:26:41 -0700 Subject: [PATCH] KVM: Add best-effort hugepage support for dedicated guest memory Extend guest_memfd to allow backing guest memory with hugepages. For now, make hugepage utilization best-effort, i.e. fall back to non-huge mappings if a hugepage can't be allocated. Guaranteeing hugepages would require a dedicated memory pool and significantly more complexity and churn.. Require userspace to opt-in via a flag even though it's unlikely real use cases will ever want to use order-0 pages, e.g. to give userspace a safety valve in case hugepage support is buggy, and to allow for easier testing of both paths. Do not take a dependency on CONFIG_TRANSPARENT_HUGEPAGE, as THP enabling primarily deals with userspace page tables, which are explicitly not in play for guest_memfd. Selecting THP does make obtaining hugepages more likely, but only because THP selects CONFIG_COMPACTION. Don't select CONFIG_COMPACTION eith
Re: [PATCH v13 22/35] KVM: Allow arch code to track number of memslot address spaces per VM
On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson wrote: > > Let x86 track the number of address spaces on a per-VM basis so that KVM > can disallow SMM memslots for confidential VMs. Confidentials VMs are > fundamentally incompatible with emulating SMM, which as the name suggests > requires being able to read and write guest memory and register state. > > Disallowing SMM will simplify support for guest private memory, as KVM > will not need to worry about tracking memory attributes for multiple > address spaces (SMM is the only "non-default" address space across all > architectures). > > Signed-off-by: Sean Christopherson > --- Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Cheers, /fuad > arch/powerpc/kvm/book3s_hv.c| 2 +- > arch/x86/include/asm/kvm_host.h | 8 +++- > arch/x86/kvm/debugfs.c | 2 +- > arch/x86/kvm/mmu/mmu.c | 6 +++--- > arch/x86/kvm/x86.c | 2 +- > include/linux/kvm_host.h| 17 +++-- > virt/kvm/dirty_ring.c | 2 +- > virt/kvm/kvm_main.c | 26 ++ > 8 files changed, 39 insertions(+), 26 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c > index 130bafdb1430..9b0eaa17275a 100644 > --- a/arch/powerpc/kvm/book3s_hv.c > +++ b/arch/powerpc/kvm/book3s_hv.c > @@ -6084,7 +6084,7 @@ static int kvmhv_svm_off(struct kvm *kvm) > } > > srcu_idx = srcu_read_lock(&kvm->srcu); > - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { > struct kvm_memory_slot *memslot; > struct kvm_memslots *slots = __kvm_memslots(kvm, i); > int bkt; > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 6702f795c862..f9e8d5642069 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -2124,9 +2124,15 @@ enum { > #define HF_SMM_MASK(1 << 1) > #define HF_SMM_INSIDE_NMI_MASK (1 << 2) > > -# define KVM_ADDRESS_SPACE_NUM 2 > +# define KVM_MAX_NR_ADDRESS_SPACES 2 > # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK > ? 1 : 0) > # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, > (role).smm) > + > +static inline int kvm_arch_nr_memslot_as_ids(struct kvm *kvm) > +{ > + return KVM_MAX_NR_ADDRESS_SPACES; > +} > + > #else > # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, 0) > #endif > diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c > index ee8c4c3496ed..42026b3f3ff3 100644 > --- a/arch/x86/kvm/debugfs.c > +++ b/arch/x86/kvm/debugfs.c > @@ -111,7 +111,7 @@ static int kvm_mmu_rmaps_stat_show(struct seq_file *m, > void *v) > mutex_lock(&kvm->slots_lock); > write_lock(&kvm->mmu_lock); > > - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { > int bkt; > > slots = __kvm_memslots(kvm, i); > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index c4e758f0aebb..baeba8fc1c38 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3755,7 +3755,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm) > kvm_page_track_write_tracking_enabled(kvm)) > goto out_success; > > - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { > slots = __kvm_memslots(kvm, i); > kvm_for_each_memslot(slot, bkt, slots) { > /* > @@ -6294,7 +6294,7 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, > gfn_t gfn_start, gfn_t gfn_e > if (!kvm_memslots_have_rmaps(kvm)) > return flush; > > - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > + for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) { > slots = __kvm_memslots(kvm, i); > > kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, > gfn_end) { > @@ -6791,7 +6791,7 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 > gen) > * modifier prior to checking for a wrap of the MMIO generation so > * that a wrap in any address space is detected. > */ > - gen &= ~((u64)KVM_ADDRESS_SPACE_NUM - 1); > + gen &= ~((u64)kvm_arch_nr_memslot_as_ids(kvm) - 1); > > /* > * The very rare case: if the MMIO generation number has wrapped, > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 824b58b44382..c4d17727b199 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -12456,7 +12456,7 @@ void __user * __x86_set_memory_region(struct kvm > *kvm, int id, gpa_t gpa, > hva = slot->userspace_addr; > } > > - for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) { > + for (i = 0; i < kvm_arch_nr_memslot_as_
Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
On Thu, Nov 2, 2023 at 2:41 PM Sean Christopherson wrote: > > On Thu, Nov 02, 2023, Fuad Tabba wrote: > > Hi, > > > > On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson > > wrote: > > > > > > Add flags to "struct kvm_gfn_range" to let notifier events target only > > > shared and only private mappings, and write up the existing mmu_notifier > > > events to be shared-only (private memory is never associated with a > > > userspace virtual address, i.e. can't be reached via mmu_notifiers). > > > > > > Add two flags so that KVM can handle the three possibilities (shared, > > > private, and shared+private) without needing something like a tri-state > > > enum. > > > > > > Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com > > > Signed-off-by: Sean Christopherson > > > --- > > > include/linux/kvm_host.h | 2 ++ > > > virt/kvm/kvm_main.c | 7 +++ > > > 2 files changed, 9 insertions(+) > > > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > > index 96aa930536b1..89c1a991a3b8 100644 > > > --- a/include/linux/kvm_host.h > > > +++ b/include/linux/kvm_host.h > > > @@ -263,6 +263,8 @@ struct kvm_gfn_range { > > > gfn_t start; > > > gfn_t end; > > > union kvm_mmu_notifier_arg arg; > > > + bool only_private; > > > + bool only_shared; > > > > If these flags aren't used in this patch series, should this patch be > > moved to the other series? > > If *both* TDX and SNP need this patch, then I think it's probably worth > applying > it now to make their lives easier. But if only one needs the support, then I > completely agree this should be punted to whichever series needs it (this also > came up in v11, but we didn't force the issue). > > Mike, Isaku? > > > Also, if shared+private is a possibility, doesn't the prefix "only_" > > confuse things a bit? I.e., what is shared+private, is it when both > > are 0 or when both are 1? I assume it's the former (both are 0), but > > it might be clearer. > > Heh, I was hoping that "only_private && only_shared" would be obviously > nonsensical. > > The only alternative I can think would be to add an enum, e.g. > > enum { > PROCESS_PRIVATE_AND_SHARED, > PROCESS_ONLY_PRIVATE, > PROCESS_ONLY_SHARED, > }; > > because every other way of expressing the flags either results in more > confusion > or an unsafe default. I.e. I want zapping only private or only shared to > require > the caller to explicitly set a non-zero value, which is how I ended up with > "only_{private,shared}" as opposed to "process_{private,shared}". I don't have a strong opinion about this. Having an enum looks good to me. Cheers, /fuad
Fbdev issue after the drm updates 'drm-next-2023-10-31-1'
Hello, There is a fbdev issue with the virtio-gpu-pci and virtio-vga. (The penguins are not displayed at boot time) Error message: [ 0.889302] virtio-pci :00:02.0: [drm] *ERROR* fbdev: Failed to setup generic emulation (ret=-2) The kernel 6.6 final doesn't have this issue. Please check the fbdev changes in the drm updates 'drm-next-2023-10-31-1'. Thanks, Christian
Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
On Thu, Nov 02, 2023, Fuad Tabba wrote: > Hi, > > On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson wrote: > > > > Add flags to "struct kvm_gfn_range" to let notifier events target only > > shared and only private mappings, and write up the existing mmu_notifier > > events to be shared-only (private memory is never associated with a > > userspace virtual address, i.e. can't be reached via mmu_notifiers). > > > > Add two flags so that KVM can handle the three possibilities (shared, > > private, and shared+private) without needing something like a tri-state > > enum. > > > > Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com > > Signed-off-by: Sean Christopherson > > --- > > include/linux/kvm_host.h | 2 ++ > > virt/kvm/kvm_main.c | 7 +++ > > 2 files changed, 9 insertions(+) > > > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > > index 96aa930536b1..89c1a991a3b8 100644 > > --- a/include/linux/kvm_host.h > > +++ b/include/linux/kvm_host.h > > @@ -263,6 +263,8 @@ struct kvm_gfn_range { > > gfn_t start; > > gfn_t end; > > union kvm_mmu_notifier_arg arg; > > + bool only_private; > > + bool only_shared; > > If these flags aren't used in this patch series, should this patch be > moved to the other series? If *both* TDX and SNP need this patch, then I think it's probably worth applying it now to make their lives easier. But if only one needs the support, then I completely agree this should be punted to whichever series needs it (this also came up in v11, but we didn't force the issue). Mike, Isaku? > Also, if shared+private is a possibility, doesn't the prefix "only_" > confuse things a bit? I.e., what is shared+private, is it when both > are 0 or when both are 1? I assume it's the former (both are 0), but > it might be clearer. Heh, I was hoping that "only_private && only_shared" would be obviously nonsensical. The only alternative I can think would be to add an enum, e.g. enum { PROCESS_PRIVATE_AND_SHARED, PROCESS_ONLY_PRIVATE, PROCESS_ONLY_SHARED, }; because every other way of expressing the flags either results in more confusion or an unsafe default. I.e. I want zapping only private or only shared to require the caller to explicitly set a non-zero value, which is how I ended up with "only_{private,shared}" as opposed to "process_{private,shared}".
Re: [PATCH v13 21/35] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro
On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson wrote: > > Drop __KVM_VCPU_MULTIPLE_ADDRESS_SPACE and instead check the value of > KVM_ADDRESS_SPACE_NUM. > > No functional change intended. > > Reviewed-by: Paolo Bonzini > Signed-off-by: Sean Christopherson > --- Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Cheers, /fuad > arch/x86/include/asm/kvm_host.h | 1 - > include/linux/kvm_host.h| 2 +- > 2 files changed, 1 insertion(+), 2 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 8d60e4745e8b..6702f795c862 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -2124,7 +2124,6 @@ enum { > #define HF_SMM_MASK(1 << 1) > #define HF_SMM_INSIDE_NMI_MASK (1 << 2) > > -# define __KVM_VCPU_MULTIPLE_ADDRESS_SPACE > # define KVM_ADDRESS_SPACE_NUM 2 > # define kvm_arch_vcpu_memslots_id(vcpu) ((vcpu)->arch.hflags & HF_SMM_MASK > ? 1 : 0) > # define kvm_memslots_for_spte_role(kvm, role) __kvm_memslots(kvm, > (role).smm) > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index e3223cafd7db..c3cfe08b1300 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -692,7 +692,7 @@ bool kvm_arch_irqchip_in_kernel(struct kvm *kvm); > #define KVM_MEM_SLOTS_NUM SHRT_MAX > #define KVM_USER_MEM_SLOTS (KVM_MEM_SLOTS_NUM - KVM_INTERNAL_MEM_SLOTS) > > -#ifndef __KVM_VCPU_MULTIPLE_ADDRESS_SPACE > +#if KVM_ADDRESS_SPACE_NUM == 1 > static inline int kvm_arch_vcpu_memslots_id(struct kvm_vcpu *vcpu) > { > return 0; > -- > 2.42.0.820.g83a721a137-goog >
Re: [PATCH v13 20/35] KVM: x86/mmu: Handle page fault for private memory
On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson wrote: > > From: Chao Peng > > Add support for resolving page faults on guest private memory for VMs > that differentiate between "shared" and "private" memory. For such VMs, > KVM_MEM_PRIVATE memslots can include both fd-based private memory and > hva-based shared memory, and KVM needs to map in the "correct" variant, > i.e. KVM needs to map the gfn shared/private as appropriate based on the > current state of the gfn's KVM_MEMORY_ATTRIBUTE_PRIVATE flag. > > For AMD's SEV-SNP and Intel's TDX, the guest effectively gets to request > shared vs. private via a bit in the guest page tables, i.e. what the guest > wants may conflict with the current memory attributes. To support such > "implicit" conversion requests, exit to user with KVM_EXIT_MEMORY_FAULT > to forward the request to userspace. Add a new flag for memory faults, > KVM_MEMORY_EXIT_FLAG_PRIVATE, to communicate whether the guest wants to > map memory as shared vs. private. > > Like KVM_MEMORY_ATTRIBUTE_PRIVATE, use bit 3 for flagging private memory > so that KVM can use bits 0-2 for capturing RWX behavior if/when userspace > needs such information, e.g. a likely user of KVM_EXIT_MEMORY_FAULT is to > exit on missing mappings when handling guest page fault VM-Exits. In > that case, userspace will want to know RWX information in order to > correctly/precisely resolve the fault. > > Note, private memory *must* be backed by guest_memfd, i.e. shared mappings > always come from the host userspace page tables, and private mappings > always come from a guest_memfd instance. > > Co-developed-by: Yu Zhang > Signed-off-by: Yu Zhang > Signed-off-by: Chao Peng > Co-developed-by: Sean Christopherson > Signed-off-by: Sean Christopherson > --- With my limited understanding of kvm-x86 mmu code: Reviewed-by: Fuad Tabba Tested the x86 code (as part of this patch series) on qemu. The x86 fault handling code was used as a guide to how it should be done for pKVM/arm64 (with similar code added there): Tested-by: Fuad Tabba Cheers, /fuad > Documentation/virt/kvm/api.rst | 8 ++- > arch/x86/kvm/mmu/mmu.c | 101 ++-- > arch/x86/kvm/mmu/mmu_internal.h | 1 + > include/linux/kvm_host.h| 8 ++- > include/uapi/linux/kvm.h| 1 + > 5 files changed, 110 insertions(+), 9 deletions(-) > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst > index 7f00c310c24a..38dc1fda4f45 100644 > --- a/Documentation/virt/kvm/api.rst > +++ b/Documentation/virt/kvm/api.rst > @@ -6837,6 +6837,7 @@ spec refer, https://github.com/riscv/riscv-sbi-doc. > > /* KVM_EXIT_MEMORY_FAULT */ > struct { > + #define KVM_MEMORY_EXIT_FLAG_PRIVATE (1ULL << 3) > __u64 flags; > __u64 gpa; > __u64 size; > @@ -6845,8 +6846,11 @@ spec refer, https://github.com/riscv/riscv-sbi-doc. > KVM_EXIT_MEMORY_FAULT indicates the vCPU has encountered a memory fault that > could not be resolved by KVM. The 'gpa' and 'size' (in bytes) describe the > guest physical address range [gpa, gpa + size) of the fault. The 'flags' > field > -describes properties of the faulting access that are likely pertinent. > -Currently, no flags are defined. > +describes properties of the faulting access that are likely pertinent: > + > + - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault > occurred > + on a private memory access. When clear, indicates the fault occurred on a > + shared access. > > Note! KVM_EXIT_MEMORY_FAULT is unique among all KVM exit reasons in that it > accompanies a return code of '-1', not '0'! errno will always be set to > EFAULT > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 4167d557c577..c4e758f0aebb 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -3147,9 +3147,9 @@ static int host_pfn_mapping_level(struct kvm *kvm, > gfn_t gfn, > return level; > } > > -int kvm_mmu_max_mapping_level(struct kvm *kvm, > - const struct kvm_memory_slot *slot, gfn_t gfn, > - int max_level) > +static int __kvm_mmu_max_mapping_level(struct kvm *kvm, > + const struct kvm_memory_slot *slot, > + gfn_t gfn, int max_level, bool > is_private) > { > struct kvm_lpage_info *linfo; > int host_level; > @@ -3161,6 +3161,9 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, > break; > } > > + if (is_private) > + return max_level; > + > if (max_level == PG_LEVEL_4K) > return PG_LEVEL_4K; > > @@ -3168,6 +3171,16 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, > return min(host_level, max_level); > } > > +int kvm_mmu_max_mapping_level(struct kvm *kvm, > + const st
Re: [PATCH v13 18/35] KVM: x86: "Reset" vcpu->run->exit_reason early in KVM_RUN
On Fri, Oct 27, 2023 at 7:23 PM Sean Christopherson wrote: > > Initialize run->exit_reason to KVM_EXIT_UNKNOWN early in KVM_RUN to reduce > the probability of exiting to userspace with a stale run->exit_reason that > *appears* to be valid. > > To support fd-based guest memory (guest memory without a corresponding > userspace virtual address), KVM will exit to userspace for various memory > related errors, which userspace *may* be able to resolve, instead of using > e.g. BUS_MCEERR_AR. And in the more distant future, KVM will also likely > utilize the same functionality to let userspace "intercept" and handle > memory faults when the userspace mapping is missing, i.e. when fast gup() > fails. > > Because many of KVM's internal APIs related to guest memory use '0' to > indicate "success, continue on" and not "exit to userspace", reporting > memory faults/errors to userspace will set run->exit_reason and > corresponding fields in the run structure fields in conjunction with a > a non-zero, negative return code, e.g. -EFAULT or -EHWPOISON. And because > KVM already returns -EFAULT in many paths, there's a relatively high > probability that KVM could return -EFAULT without setting run->exit_reason, > in which case reporting KVM_EXIT_UNKNOWN is much better than reporting > whatever exit reason happened to be in the run structure. > > Note, KVM must wait until after run->immediate_exit is serviced to > sanitize run->exit_reason as KVM's ABI is that run->exit_reason is > preserved across KVM_RUN when run->immediate_exit is true. > > Link: https://lore.kernel.org/all/20230908222905.1321305-1-amoor...@google.com > Link: https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com > Signed-off-by: Sean Christopherson > --- Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Cheers, /fuad > arch/x86/kvm/x86.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index ee3cd8c3c0ef..f41dbb1465a0 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -10963,6 +10963,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu) > { > int r; > > + vcpu->run->exit_reason = KVM_EXIT_UNKNOWN; > vcpu->arch.l1tf_flush_l1d = true; > > for (;;) { > -- > 2.42.0.820.g83a721a137-goog >
Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
Hi, On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson wrote: > > Add flags to "struct kvm_gfn_range" to let notifier events target only > shared and only private mappings, and write up the existing mmu_notifier > events to be shared-only (private memory is never associated with a > userspace virtual address, i.e. can't be reached via mmu_notifiers). > > Add two flags so that KVM can handle the three possibilities (shared, > private, and shared+private) without needing something like a tri-state > enum. > > Link: https://lore.kernel.org/all/zjx0hk+kpqp0k...@google.com > Signed-off-by: Sean Christopherson > --- > include/linux/kvm_host.h | 2 ++ > virt/kvm/kvm_main.c | 7 +++ > 2 files changed, 9 insertions(+) > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 96aa930536b1..89c1a991a3b8 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -263,6 +263,8 @@ struct kvm_gfn_range { > gfn_t start; > gfn_t end; > union kvm_mmu_notifier_arg arg; > + bool only_private; > + bool only_shared; If these flags aren't used in this patch series, should this patch be moved to the other series? Also, if shared+private is a possibility, doesn't the prefix "only_" confuse things a bit? I.e., what is shared+private, is it when both are 0 or when both are 1? I assume it's the former (both are 0), but it might be clearer. Cheers, /fuad > bool may_block; > }; > bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range); > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index cb9376833c18..302ccb87b4c1 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -635,6 +635,13 @@ static __always_inline kvm_mn_ret_t > __kvm_handle_hva_range(struct kvm *kvm, > * the second or later invocation of the handler). > */ > gfn_range.arg = range->arg; > + > + /* > +* HVA-based notifications aren't relevant to private > +* mappings as they don't have a userspace mapping. > +*/ > + gfn_range.only_private = false; > + gfn_range.only_shared = true; > gfn_range.may_block = range->may_block; > > /* > -- > 2.42.0.820.g83a721a137-goog >
Re: [PATCH v13 11/35] KVM: Drop .on_unlock() mmu_notifier hook
On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson wrote: > > Drop the .on_unlock() mmu_notifer hook now that it's no longer used for > notifying arch code that memory has been reclaimed. Adding .on_unlock() > and invoking it *after* dropping mmu_lock was a terrible idea, as doing so > resulted in .on_lock() and .on_unlock() having divergent and asymmetric > behavior, and set future developers up for failure, i.e. all but asked for > bugs where KVM relied on using .on_unlock() to try to run a callback while > holding mmu_lock. > > Opportunistically add a lockdep assertion in kvm_mmu_invalidate_end() to > guard against future bugs of this nature. > > Reported-by: Isaku Yamahata > Link: > https://lore.kernel.org/all/20230802203119.gb2021...@ls.amr.corp.intel.com > Signed-off-by: Sean Christopherson > --- Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Cheers, /fuad > virt/kvm/kvm_main.c | 13 +++-- > 1 file changed, 3 insertions(+), 10 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 2bc04c8ae1f4..cb9376833c18 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -544,7 +544,6 @@ static inline struct kvm *mmu_notifier_to_kvm(struct > mmu_notifier *mn) > typedef bool (*gfn_handler_t)(struct kvm *kvm, struct kvm_gfn_range *range); > > typedef void (*on_lock_fn_t)(struct kvm *kvm); > -typedef void (*on_unlock_fn_t)(struct kvm *kvm); > > struct kvm_mmu_notifier_range { > /* > @@ -556,7 +555,6 @@ struct kvm_mmu_notifier_range { > union kvm_mmu_notifier_arg arg; > gfn_handler_t handler; > on_lock_fn_t on_lock; > - on_unlock_fn_t on_unlock; > bool flush_on_ret; > bool may_block; > }; > @@ -663,11 +661,8 @@ static __always_inline kvm_mn_ret_t > __kvm_handle_hva_range(struct kvm *kvm, > if (range->flush_on_ret && r.ret) > kvm_flush_remote_tlbs(kvm); > > - if (r.found_memslot) { > + if (r.found_memslot) > KVM_MMU_UNLOCK(kvm); > - if (!IS_KVM_NULL_FN(range->on_unlock)) > - range->on_unlock(kvm); > - } > > srcu_read_unlock(&kvm->srcu, idx); > > @@ -687,7 +682,6 @@ static __always_inline int kvm_handle_hva_range(struct > mmu_notifier *mn, > .arg= arg, > .handler= handler, > .on_lock= (void *)kvm_null_fn, > - .on_unlock = (void *)kvm_null_fn, > .flush_on_ret = true, > .may_block = false, > }; > @@ -706,7 +700,6 @@ static __always_inline int > kvm_handle_hva_range_no_flush(struct mmu_notifier *mn > .end= end, > .handler= handler, > .on_lock= (void *)kvm_null_fn, > - .on_unlock = (void *)kvm_null_fn, > .flush_on_ret = false, > .may_block = false, > }; > @@ -813,7 +806,6 @@ static int kvm_mmu_notifier_invalidate_range_start(struct > mmu_notifier *mn, > .end= range->end, > .handler= kvm_mmu_unmap_gfn_range, > .on_lock= kvm_mmu_invalidate_begin, > - .on_unlock = (void *)kvm_null_fn, > .flush_on_ret = true, > .may_block = mmu_notifier_range_blockable(range), > }; > @@ -858,6 +850,8 @@ static int kvm_mmu_notifier_invalidate_range_start(struct > mmu_notifier *mn, > > void kvm_mmu_invalidate_end(struct kvm *kvm) > { > + lockdep_assert_held_write(&kvm->mmu_lock); > + > /* > * This sequence increase will notify the kvm page fault that > * the page that is going to be mapped in the spte could have > @@ -889,7 +883,6 @@ static void kvm_mmu_notifier_invalidate_range_end(struct > mmu_notifier *mn, > .end= range->end, > .handler= (void *)kvm_null_fn, > .on_lock= kvm_mmu_invalidate_end, > - .on_unlock = (void *)kvm_null_fn, > .flush_on_ret = false, > .may_block = mmu_notifier_range_blockable(range), > }; > -- > 2.42.0.820.g83a721a137-goog >
Re: [PATCH v13 10/35] KVM: Add a dedicated mmu_notifier flag for reclaiming freed memory
On Fri, Oct 27, 2023 at 7:22 PM Sean Christopherson wrote: > > Handle AMD SEV's kvm_arch_guest_memory_reclaimed() hook by having > __kvm_handle_hva_range() return whether or not an overlapping memslot > was found, i.e. mmu_lock was acquired. Using the .on_unlock() hook > works, but kvm_arch_guest_memory_reclaimed() needs to run after dropping > mmu_lock, which makes .on_lock() and .on_unlock() asymmetrical. > > Use a small struct to return the tuple of the notifier-specific return, > plus whether or not overlap was found. Because the iteration helpers are > __always_inlined, practically speaking, the struct will never actually be > returned from a function call (not to mention the size of the struct will > be two bytes in practice). > > Signed-off-by: Sean Christopherson > --- Reviewed-by: Fuad Tabba Tested-by: Fuad Tabba Cheers, /fuad > virt/kvm/kvm_main.c | 53 +++-- > 1 file changed, 37 insertions(+), 16 deletions(-) > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 3f5b7c2c5327..2bc04c8ae1f4 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -561,6 +561,19 @@ struct kvm_mmu_notifier_range { > bool may_block; > }; > > +/* > + * The inner-most helper returns a tuple containing the return value from the > + * arch- and action-specific handler, plus a flag indicating whether or not > at > + * least one memslot was found, i.e. if the handler found guest memory. > + * > + * Note, most notifiers are averse to booleans, so even though KVM tracks the > + * return from arch code as a bool, outer helpers will cast it to an int. :-( > + */ > +typedef struct kvm_mmu_notifier_return { > + bool ret; > + bool found_memslot; > +} kvm_mn_ret_t; > + > /* > * Use a dedicated stub instead of NULL to indicate that there is no callback > * function/handler. The compiler technically can't guarantee that a real > @@ -582,22 +595,25 @@ static const union kvm_mmu_notifier_arg > KVM_MMU_NOTIFIER_NO_ARG; > node; \ > node = interval_tree_iter_next(node, start, last)) \ > > -static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > - const struct > kvm_mmu_notifier_range *range) > +static __always_inline kvm_mn_ret_t __kvm_handle_hva_range(struct kvm *kvm, > + const struct > kvm_mmu_notifier_range *range) > { > - bool ret = false, locked = false; > + struct kvm_mmu_notifier_return r = { > + .ret = false, > + .found_memslot = false, > + }; > struct kvm_gfn_range gfn_range; > struct kvm_memory_slot *slot; > struct kvm_memslots *slots; > int i, idx; > > if (WARN_ON_ONCE(range->end <= range->start)) > - return 0; > + return r; > > /* A null handler is allowed if and only if on_lock() is provided. */ > if (WARN_ON_ONCE(IS_KVM_NULL_FN(range->on_lock) && > IS_KVM_NULL_FN(range->handler))) > - return 0; > + return r; > > idx = srcu_read_lock(&kvm->srcu); > > @@ -631,8 +647,8 @@ static __always_inline int __kvm_handle_hva_range(struct > kvm *kvm, > gfn_range.end = hva_to_gfn_memslot(hva_end + > PAGE_SIZE - 1, slot); > gfn_range.slot = slot; > > - if (!locked) { > - locked = true; > + if (!r.found_memslot) { > + r.found_memslot = true; > KVM_MMU_LOCK(kvm); > if (!IS_KVM_NULL_FN(range->on_lock)) > range->on_lock(kvm); > @@ -640,14 +656,14 @@ static __always_inline int > __kvm_handle_hva_range(struct kvm *kvm, > if (IS_KVM_NULL_FN(range->handler)) > break; > } > - ret |= range->handler(kvm, &gfn_range); > + r.ret |= range->handler(kvm, &gfn_range); > } > } > > - if (range->flush_on_ret && ret) > + if (range->flush_on_ret && r.ret) > kvm_flush_remote_tlbs(kvm); > > - if (locked) { > + if (r.found_memslot) { > KVM_MMU_UNLOCK(kvm); > if (!IS_KVM_NULL_FN(range->on_unlock)) > range->on_unlock(kvm); > @@ -655,8 +671,7 @@ static __always_inline int __kvm_handle_hva_range(struct > kvm *kvm, > > srcu_read_unlock(&kvm->srcu, idx); > > - /* The notifiers are averse to booleans. :-( */ > - return (int)ret; > + return r; > } > > static __always_inline int kvm_handle_hva_range(struct mmu_notifier *mn, > @@ -677,7 +692
Re: [PATCH v13 16/35] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
On Wed, Nov 1, 2023 at 9:55 PM Sean Christopherson wrote: > > On Wed, Nov 01, 2023, Fuad Tabba wrote: > > > > > @@ -1034,6 +1034,9 @@ static void kvm_destroy_dirty_bitmap(struct > > > > > kvm_memory_slot *memslot) > > > > > /* This does not remove the slot from struct kvm_memslots data > > > > > structures */ > > > > > static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot > > > > > *slot) > > > > > { > > > > > + if (slot->flags & KVM_MEM_PRIVATE) > > > > > + kvm_gmem_unbind(slot); > > > > > + > > > > > > > > Should this be called after kvm_arch_free_memslot()? Arch-specific ode > > > > might need some of the data before the unbinding, something I thought > > > > might be necessary at one point for the pKVM port when deleting a > > > > memslot, but realized later that kvm_invalidate_memslot() -> > > > > kvm_arch_guest_memory_reclaimed() was the more logical place for it. > > > > Also, since that seems to be the pattern for arch-specific handlers in > > > > KVM. > > > > > > Maybe? But only if we can about symmetry between the allocation and free > > > paths > > > I really don't think kvm_arch_free_memslot() should be doing anything > > > beyond a > > > "pure" free. E.g. kvm_arch_free_memslot() is also called after moving a > > > memslot, > > > which hopefully we never actually have to allow for guest_memfd, but any > > > code in > > > kvm_arch_free_memslot() would bring about "what if" questions regarding > > > memslot > > > movement. I.e. the API is intended to be a "free arch metadata > > > associated with > > > the memslot". > > > > > > Out of curiosity, what does pKVM need to do at > > > kvm_arch_guest_memory_reclaimed()? > > > > It's about the host reclaiming ownership of guest memory when tearing > > down a protected guest. In pKVM, we currently teardown the guest and > > reclaim its memory when kvm_arch_destroy_vm() is called. The problem > > with guestmem is that kvm_gmem_unbind() could get called before that > > happens, after which the host might try to access the unbound guest > > memory. Since the host hasn't reclaimed ownership of the guest memory > > from hyp, hilarity ensues (it crashes). > > > > Initially, I hooked reclaim guest memory to kvm_free_memslot(), but > > then I needed to move the unbind later in the function. I realized > > later that kvm_arch_guest_memory_reclaimed() gets called earlier (at > > the right time), and is more aptly named. > > Aha! I suspected that might be the case. > > TDX and SNP also need to solve the same problem of "reclaiming" memory before > it > can be safely accessed by the host. The plan is to add an arch hook (or two?) > into guest_memfd that is invoked when memory is freed from guest_memfd. > > Hooking kvm_arch_guest_memory_reclaimed() isn't completely correct as > deleting a > memslot doesn't *guarantee* that guest memory is actually reclaimed (which > reminds > me, we need to figure out a better name for that thing before introducing > kvm_arch_gmem_invalidate()). I see. I'd assumed that that was what you're using. I agree that it's not completely correct, so for the moment, I assume that if that happens we have a misbehaving host, teardown the guest and reclaim its memory. > The effective false positives aren't fatal for the current usage because the > hook > is used only for x86 SEV guests to flush caches. An unnecessary flush can > cause > performance issues, but it doesn't affect correctness. For TDX and SNP, and > IIUC > pKVM, false positives are fatal because KVM could assign memory back to the > host > that is still owned by guest_memfd. Yup. > E.g. a misbehaving userspace could prematurely delete a memslot. And the more > fun example is intrahost migration, where the plan is to allow pointing > multiple > guest_memfd files at a single guest_memfd inode: > https://lore.kernel.org/all/cover.1691446946.git.ackerley...@google.com > > There was a lot of discussion for this, but it's scattered all over the place. > The TL;DR is is that the inode will represent physical memory, and a file will > represent a given "struct kvm" instance's view of that memory. And so the > memory > isn't reclaimed until the inode is truncated/punched. > > I _think_ this reflects the most recent plan from the guest_memfd side: > https://lore.kernel.org/all/1233d749211c08d51f9ca5d427938d47f008af1f.1689893403.git.isaku.yamah...@intel.com Thanks for pointing that out. I think this might be the way to go. I'll have a closer look at this and see how to get it to work with pKVM. Cheers, /fuad
[RFC PATCH] powerpc/book3s/hash: Drop _PAGE_PRIVILEGED from PAGE_NONE
There used to be a dependency on _PAGE_PRIVILEGED with pte_savedwrite. But that got dropped by commit 6a56ccbcf6c6 ("mm/autonuma: use can_change_(pte|pmd)_writable() to replace savedwrite") With this change numa fault pte (pte_protnone()) gets mapped as regular user pte with RWX cleared (no-access). This also remove pte_user() from book3s/64. pte_access_permitted() now checks for _PAGE_EXEC because we now support EXECONLY mappings. Signed-off-by: Aneesh Kumar K.V --- arch/powerpc/include/asm/book3s/64/pgtable.h | 23 +--- arch/powerpc/mm/book3s64/hash_utils.c| 17 +++ 2 files changed, 23 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h index cb77eddca54b..7c7de7b56df0 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -17,12 +17,6 @@ #define _PAGE_EXEC 0x1 /* execute permission */ #define _PAGE_WRITE0x2 /* write access allowed */ #define _PAGE_READ 0x4 /* read access allowed */ -#define _PAGE_NA _PAGE_PRIVILEGED -#define _PAGE_NAX _PAGE_EXEC -#define _PAGE_RO _PAGE_READ -#define _PAGE_ROX (_PAGE_READ | _PAGE_EXEC) -#define _PAGE_RW (_PAGE_READ | _PAGE_WRITE) -#define _PAGE_RWX (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC) #define _PAGE_PRIVILEGED 0x8 /* kernel access only */ #define _PAGE_SAO 0x00010 /* Strong access order */ #define _PAGE_NON_IDEMPOTENT 0x00020 /* non idempotent memory */ @@ -119,9 +113,9 @@ /* * user access blocked by key */ -#define _PAGE_KERNEL_RW(_PAGE_PRIVILEGED | _PAGE_RW | _PAGE_DIRTY) #define _PAGE_KERNEL_RO (_PAGE_PRIVILEGED | _PAGE_READ) #define _PAGE_KERNEL_ROX(_PAGE_PRIVILEGED | _PAGE_READ | _PAGE_EXEC) +#define _PAGE_KERNEL_RW(_PAGE_PRIVILEGED | _PAGE_RW | _PAGE_DIRTY) #define _PAGE_KERNEL_RWX (_PAGE_PRIVILEGED | _PAGE_DIRTY | _PAGE_RW | _PAGE_EXEC) /* * _PAGE_CHG_MASK masks of bits that are to be preserved across @@ -523,19 +517,14 @@ static inline bool arch_pte_access_permitted(u64 pte, bool write, bool execute) } #endif /* CONFIG_PPC_MEM_KEYS */ -static inline bool pte_user(pte_t pte) -{ - return !(pte_raw(pte) & cpu_to_be64(_PAGE_PRIVILEGED)); -} - #define pte_access_permitted pte_access_permitted static inline bool pte_access_permitted(pte_t pte, bool write) { - /* -* _PAGE_READ is needed for any access and will be -* cleared for PROT_NONE -*/ - if (!pte_present(pte) || !pte_user(pte) || !pte_read(pte)) + + if (!pte_present(pte)) + return false; + + if (!(pte_read(pte) || pte_exec(pte))) return false; if (write && !pte_write(pte)) diff --git a/arch/powerpc/mm/book3s64/hash_utils.c b/arch/powerpc/mm/book3s64/hash_utils.c index ad2afa08e62e..b2eda22195f0 100644 --- a/arch/powerpc/mm/book3s64/hash_utils.c +++ b/arch/powerpc/mm/book3s64/hash_utils.c @@ -310,9 +310,26 @@ unsigned long htab_convert_pte_flags(unsigned long pteflags, unsigned long flags else rflags |= 0x3; } + WARN_ON(!(pteflags & _PAGE_RWX)); } else { if (pteflags & _PAGE_RWX) rflags |= 0x2; + else { + /* +* PAGE_NONE will get mapped to 0b110 (slb key 1 no access) +* We picked 0b110 instead of 0b000 so that slb key 0 will +* get only read only access for the same rflags. +*/ + if (mmu_has_feature(MMU_FTR_KERNEL_RO)) + rflags |= (HPTE_R_PP0 | 0x2); + /* +* rflags = HPTE_R_N +* Without KERNEL_RO feature this will result in slb +* key 0 with read/write. But ISA only supports that. +* There is no key 1 no-access and key 0 read-only +* pp bit support. +*/ + } if (!((pteflags & _PAGE_WRITE) && (pteflags & _PAGE_DIRTY))) rflags |= 0x1; } -- 2.41.0
Re: [PATCH v2] powerpc/mm: Avoid calling arch_enter/leave_lazy_mmu() in set_ptes
Matthew Wilcox writes: > On Tue, Oct 24, 2023 at 08:06:04PM +0530, Aneesh Kumar K.V wrote: >> ptep++; >> -pte = __pte(pte_val(pte) + (1UL << PTE_RPN_SHIFT)); >> addr += PAGE_SIZE; >> +/* >> + * increment the pfn. >> + */ >> +pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte))); > > when i looked at this, it generated shit code. did you check? I didn't look ... It's not super clear cut. There's some difference because pfn_pte() contains two extra VM_BUG_ONs. But with DEBUG_VM *off* the version using pfn_pte() generates *better* code, or at least less code, ~160 instructions vs ~200. For some reason the version using PTE_RPN_SHIFT seems to be byte swapping the pte an extra two times, each of which generates ~8 instructions. But I can't see why. I tried a few other things and couldn't come up with anything that generated better code. But I'll keep poking at it tomorrow. cheers
Re: [PATCH v13 12/35] KVM: Prepare for handling only shared mappings in mmu_notifier events
On 11/2/23 06:59, Binbin Wu wrote: Add flags to "struct kvm_gfn_range" to let notifier events target only shared and only private mappings, and write up the existing mmu_notifier events to be shared-only (private memory is never associated with a userspace virtual address, i.e. can't be reached via mmu_notifiers). Add two flags so that KVM can handle the three possibilities (shared, private, and shared+private) without needing something like a tri-state enum. I see the two flags are set/cleared in __kvm_handle_hva_range() in this patch and kvm_handle_gfn_range() from the later patch 13/35, but I didn't see they are used/read in this patch series if I didn't miss anything. How are they supposed to be used in KVM? They are going to be used by SNP/TDX patches. Paolo
Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
On 11/2/23 10:35, Huang, Kai wrote: IIUC KVM can already handle the case of poisoned page by sending signal to user app: static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { ... if (fault->pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(fault->slot, fault->gfn); return RET_PF_RETRY; } } EHWPOISON is not implemented by this series, so it should be left out of the documentation. Currently as mentioned above when vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and Qemu prints ... ...: Bad address ... which is nonsense. If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry a specific value for EPC to let Qemu know and Qemu can then do more reasonable things. Yes, that's a good idea that can be implemented on top. Paolo
Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
On 11/1/23 18:36, Sean Christopherson wrote: A good example is KVM_RUN with -EINTR; if KVM were to return something other than -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would fall over. And dually if KVM were to return KVM_EXIT_INTR together with something other than -EINTR. And purging exit_reason super early is subtly tricky because KVM's (again, poorly documented) ABI is that *some* exit reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or with a pending signal). https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com vcpu->run->immediate_exit preserves all exit reasons, but it's not a good idea that immediate_exit behaves different from a pending signal on entry to KVM_RUN (remember that immediate_exit was meant to be a better performing alternative to KVM_SET_SIGNAL_MASK). In principle, vcpu->run->immediate_exit could return KVM_EXIT_INTR (perhaps even _should_, except that breaks selftests so at this point it is ABI). Paolo
Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
On Thu, 2023-11-02 at 11:32 +0100, Paolo Bonzini wrote: > > > +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES > > > +static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, > > > gfn_t gfn) > > > +{ > > > + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)); > > > +} > > > > Only call xa_to_value() when xa_load() returns !NULL? > > This xarray does not store a pointer, therefore xa_load() actually > returns an integer that is tagged with 1 in the low bit: > > static inline unsigned long xa_to_value(const void *entry) > { > return (unsigned long)entry >> 1; > } > > Returning zero for an empty entry is okay, so the result of xa_load() > can be used directly. Thanks for explaining. I was thinking perhaps it's better to do: void *entry = xa_load(...); return xa_is_value(entry) ? xa_to_value(entry) : 0; But "NULL (0) >> 1" is still 0, so yes we can use directly.
Re: [PATCH v13 13/35] KVM: Introduce per-page memory attributes
On 11/2/23 04:01, Huang, Kai wrote: On Fri, 2023-10-27 at 11:21 -0700, Sean Christopherson wrote: From: Chao Peng In confidential computing usages, whether a page is private or shared is necessary information for KVM to perform operations like page fault handling, page zapping etc. There are other potential use cases for per-page memory attributes, e.g. to make memory read-only (or no-exec, or exec-only, etc.) without having to modify memslots. Introduce two ioctls (advertised by KVM_CAP_MEMORY_ATTRIBUTES) to allow userspace to operate on the per-page memory attributes. - KVM_SET_MEMORY_ATTRIBUTES to set the per-page memory attributes to a guest memory range. - KVM_GET_SUPPORTED_MEMORY_ATTRIBUTES to return the KVM supported memory attributes. Use an xarray to store the per-page attributes internally, with a naive, not fully optimized implementation, i.e. prioritize correctness over performance for the initial implementation. Use bit 3 for the PRIVATE attribute so that KVM can use bits 0-2 for RWX attributes/protections in the future, e.g. to give userspace fine-grained control over read, write, and execute protections for guest memory. Provide arch hooks for handling attribute changes before and after common code sets the new attributes, e.g. x86 will use the "pre" hook to zap all relevant mappings, and the "post" hook to track whether or not hugepages can be used to map the range. To simplify the implementation wrap the entire sequence with kvm_mmu_invalidate_{begin,end}() even though the operation isn't strictly guaranteed to be an invalidation. For the initial use case, x86 *will* always invalidate memory, and preventing arch code from creating new mappings while the attributes are in flux makes it much easier to reason about the correctness of consuming attributes. It's possible that future usages may not require an invalidation, e.g. if KVM ends up supporting RWX protections and userspace grants _more_ protections, but again opt for simplicity and punt optimizations to if/when they are needed. Suggested-by: Sean Christopherson Link: https://lore.kernel.org/all/y2wb48kd0j4vg...@google.com Cc: Fuad Tabba Cc: Xu Yilun Cc: Mickaël Salaün Signed-off-by: Chao Peng Co-developed-by: Sean Christopherson Signed-off-by: Sean Christopherson [...] +Note, there is no "get" API. Userspace is responsible for explicitly tracking +the state of a gfn/page as needed. + [...] +#ifdef CONFIG_KVM_GENERIC_MEMORY_ATTRIBUTES +static inline unsigned long kvm_get_memory_attributes(struct kvm *kvm, gfn_t gfn) +{ + return xa_to_value(xa_load(&kvm->mem_attr_array, gfn)); +} Only call xa_to_value() when xa_load() returns !NULL? This xarray does not store a pointer, therefore xa_load() actually returns an integer that is tagged with 1 in the low bit: static inline unsigned long xa_to_value(const void *entry) { return (unsigned long)entry >> 1; } Returning zero for an empty entry is okay, so the result of xa_load() can be used directly. + +bool kvm_range_has_memory_attributes(struct kvm *kvm, gfn_t start, gfn_t end, +unsigned long attrs); Seems it's not immediately clear why this function is needed in this patch, especially when you said there is no "get" API above. Add some material to changelog? It's used by later patches; even without a "get" API, it's a pretty fundamental functionality. +bool kvm_arch_pre_set_memory_attributes(struct kvm *kvm, + struct kvm_gfn_range *range); +bool kvm_arch_post_set_memory_attributes(struct kvm *kvm, +struct kvm_gfn_range *range); Looks if this Kconfig is on, the above two arch hooks won't have implementation. Is it better to have two __weak empty versions here in this patch? Anyway, from the changelog it seems it's not mandatory for some ARCH to provide the above two if one wants to turn this on, i.e., the two hooks can be empty and the ARCH can just use the __weak version. I think this can be added by the first arch that needs memory attributes and also doesn't need one of these hooks. Or perhaps the x86 kvm_arch_pre_set_memory_attributes() could be made generic and thus that would be the __weak version. It's too early to tell, so it's better to leave the implementation to the architectures for now. Paolo
Re: [PATCH v13 09/35] KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace
On Thu, 2023-11-02 at 03:17 +, Huang, Kai wrote: > On Wed, 2023-11-01 at 10:36 -0700, Sean Christopherson wrote: > > On Wed, Nov 01, 2023, Kai Huang wrote: > > > > > > > +7.34 KVM_CAP_MEMORY_FAULT_INFO > > > > +-- > > > > + > > > > +:Architectures: x86 > > > > +:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP. > > > > + > > > > +The presence of this capability indicates that KVM_RUN will fill > > > > +kvm_run.memory_fault if KVM cannot resolve a guest page fault VM-Exit, > > > > e.g. if > > > > +there is a valid memslot but no backing VMA for the corresponding host > > > > virtual > > > > +address. > > > > + > > > > +The information in kvm_run.memory_fault is valid if and only if > > > > KVM_RUN returns > > > > +an error with errno=EFAULT or errno=EHWPOISON *and* > > > > kvm_run.exit_reason is set > > > > +to KVM_EXIT_MEMORY_FAULT. > > > > > > IIUC returning -EFAULT or whatever -errno is sort of KVM internal > > > implementation. > > > > The errno that is returned to userspace is ABI. In KVM, it's a _very_ > > poorly > > defined ABI for the vast majority of ioctls(), but it's still technically > > ABI. > > KVM gets away with being cavalier with errno because the vast majority of > > errors > > are considered fatal by userespace, i.e. in most cases, userspace simply > > doesn't > > care about the exact errno. > > > > A good example is KVM_RUN with -EINTR; if KVM were to return something > > other than > > -EINTR on a pending signal or vcpu->run->immediate_exit, userspace would > > fall over. > > > > > Is it better to relax the validity of kvm_run.memory_fault when > > > KVM_RUN returns any -errno? > > > > Not unless there's a need to do so, and if there is then we can update the > > documentation accordingly. If KVM's ABI is that kvm_run.memory_fault is > > valid > > for any errno, then KVM would need to purge kvm_run.exit_reason super early > > in > > KVM_RUN, e.g. to prevent an -EINTR return due to immediate_exit from being > > misinterpreted as KVM_EXIT_MEMORY_FAULT. And purging exit_reason super > > early is > > subtly tricky because KVM's (again, poorly documented) ABI is that *some* > > exit > > reasons are preserved across KVM_RUN with vcpu->run->immediate_exit (or > > with a > > pending signal). > > > > https://lore.kernel.org/all/zffbwoxz5ui%2fg...@google.com > > > > > > Agreed with not to relax to any errno. However using -EFAULT as part of ABI > definition seems a little bit dangerous, e.g., someone could accidentally or > mistakenly return -EFAULT in KVM_RUN at early time and/or in a completely > different code path, etc. -EINTR has well defined meaning, but -EFAULT (which > is "Bad address") seems doesn't but I am not sure either. :-) > > One example is, for backing VMA with VM_IO | VM_PFNMAP, hva_to_pfn() returns > KVM_PFN_ERR_FAULT when the kernel cannot get a valid PFN (e.g. when SGX vepc > fault handler failed to allocate EPC) and kvm_handle_error_pfn() will just > return -EFAULT. If kvm_run.exit_reason isn't purged early then is it possible > to have some issue here? > Also, regardless whether -EFAULT is too ambiguous to be part of ABI, could you elaborate the EHWPOISON part? IIUC KVM can already handle the case of poisoned page by sending signal to user app: static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault) { ... if (fault->pfn == KVM_PFN_ERR_HWPOISON) { kvm_send_hwpoison_signal(fault->slot, fault->gfn); return RET_PF_RETRY; } } And (sorry to hijack) I am thinking whether "SGX vepc unable to allocate EPC" can also use this memory_fault mechanism. Currently as mentioned above when vepc fault handler cannot allocate EPC page KVM returns -EFAULT to Qemu, and Qemu prints ... ...: Bad address ... which is nonsense. If we can use memory_fault.flags (or is 'fault_reason' a better name?) to carry a specific value for EPC to let Qemu know and Qemu can then do more reasonable things.
Re: Does anyone use Appletalk?
On Thu, 2023-11-02 at 13:13 +1100, Finn Thain wrote: > So I can't object to the removal of the localtalk code. But I do object to > the underhand manner in which it is done. I agree. I have the impression that the actual users of the affected code are never asked. It's usually a question posed on a mailing list where 99% of the affected users don't hang around. Naturally, there won't be any objections to the removal because affected users didn't receive a heads-up in the first place. Adrian -- .''`. John Paul Adrian Glaubitz : :' : Debian Developer `. `' Physicist `-GPG: 62FF 8A75 84E0 2956 9546 0006 7426 3B37 F5B5 F913
Re: [PATCH 1/2] kexec: fix KEXEC_FILE dependencies
Hi Arnd, On 10/24/23 at 03:17pm, Arnd Bergmann wrote: > On Tue, Oct 24, 2023, at 14:44, Baoquan He wrote: > > Just add people and mailing list to CC since I didn't find this mail in > > my box, just drag it via 'b4 am'. > > > > On 10/23/23 at 01:01pm, Arnd Bergmann wrote: > > .. > > >> diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec > >> index 7aff28ded2f48..bfc636d64ff2b 100644 > >> --- a/kernel/Kconfig.kexec > >> +++ b/kernel/Kconfig.kexec > >> @@ -36,6 +36,7 @@ config KEXEC > >> config KEXEC_FILE > >>bool "Enable kexec file based system call" > >>depends on ARCH_SUPPORTS_KEXEC_FILE > >> + depends on CRYPTO_SHA256=y || !ARCH_SUPPORTS_KEXEC_PURGATORY > > > > I am not sure if the logic is correct. In theory, kexec_file code > > utilizes purgatory to verify the checksum digested during kernel loading > > when try to jump to the kernel. That means kexec_file depends on > > purgatory, but not contrary? > > The expression I wrote is a bit confusing, but I think this just > keeps the existing behavior: > > - on architectures that select ARCH_SUPPORTS_KEXEC_PURGATORY > (powerpc, riscv, s390 and x86), we also require CRYPTO_SHA256 > to be built-in. > - on architectures that do not have ARCH_SUPPORTS_KEXEC_PURGATORY > (arm64 and parisc), CRYPTO_SHA256 is not used and can be disabled > or =m. > > Since ARCH_SUPPORTS_KEXEC_PURGATORY is a 'bool' symbol, it could > be written as > > depends on (ARCH_SUPPORTS_KEXEC_PURGATORY && CRYPTO_SHA256=y) \ >|| !ARCH_SUPPORTS_KEXEC_PURGATORY > > if you find that clearer. I see that the second patch > actually gets this wrong, it should actually do > > select CRYPTO if ARCH_SUPPORTS_KEXEC_PURGATORY > select CRYPTO_SHA256 if ARCH_SUPPORTS_KEXEC_PURGATORY > > > With these changes, we can achieve the goal to avoid building issue, > > whereas the code logic becomes confusing. E.g people could disable > > CONFIG_KEXEC_FILE, but still get purgatory code built in which is > > totally useless. > > > > Not sure if I think too much over this. > > I see your point here, and I would suggest changing the > CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY symbol to just indicate > the availability of the purgatory code for the arch, rather > than actually controlling the code itself. I already mentioned > this for s390, but riscv would need the same thing on top. > > I think the change below should address your concern. Since no new comment, do you mind spinning v2 to wrap all these up? Thanks Baoquan > > Arnd > > diff --git a/arch/riscv/kernel/elf_kexec.c b/arch/riscv/kernel/elf_kexec.c > index e60fbd8660c4..3ac341d296db 100644 > --- a/arch/riscv/kernel/elf_kexec.c > +++ b/arch/riscv/kernel/elf_kexec.c > @@ -266,7 +266,7 @@ static void *elf_kexec_load(struct kimage *image, char > *kernel_buf, > cmdline = modified_cmdline; > } > > -#ifdef CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY > +#ifdef CONFIG_KEXEC_FILE > /* Add purgatory to the image */ > kbuf.top_down = true; > kbuf.mem = KEXEC_BUF_MEM_UNKNOWN; > @@ -280,7 +280,7 @@ static void *elf_kexec_load(struct kimage *image, char > *kernel_buf, > sizeof(kernel_start), 0); > if (ret) > pr_err("Error update purgatory ret=%d\n", ret); > -#endif /* CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY */ > +#endif /* CONFIG_KEXEC_FILE */ > > /* Add the initrd to the image */ > if (initrd != NULL) { > diff --git a/arch/riscv/Kbuild b/arch/riscv/Kbuild > index d25ad1c19f88..ab181d187c23 100644 > --- a/arch/riscv/Kbuild > +++ b/arch/riscv/Kbuild > @@ -5,7 +5,7 @@ obj-$(CONFIG_BUILTIN_DTB) += boot/dts/ > obj-y += errata/ > obj-$(CONFIG_KVM) += kvm/ > > -obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/ > +obj-$(CONFIG_KEXEC_FILE) += purgatory/ > > # for cleaning > subdir- += boot > diff --git a/arch/s390/Kbuild b/arch/s390/Kbuild > index a5d3503b353c..361aa01dbd49 100644 > --- a/arch/s390/Kbuild > +++ b/arch/s390/Kbuild > @@ -7,7 +7,7 @@ obj-$(CONFIG_S390_HYPFS)+= hypfs/ > obj-$(CONFIG_APPLDATA_BASE)+= appldata/ > obj-y += net/ > obj-$(CONFIG_PCI) += pci/ > -obj-$(CONFIG_ARCH_SUPPORTS_KEXEC_PURGATORY) += purgatory/ > +obj-$(CONFIG_KEXEC_FILE) += purgatory/ > > # for cleaning > subdir- += boot tools >