Re: [PATCH] powerpc/kvm: Fix typo in the kvm functions

2024-09-20 Thread Amit Machhiwal
On 2024/09/20 05:08 PM, Kajol Jain wrote:
> Fix typo in the following kvm function names from:
> 
> kmvhv_counters_tracepoint_regfunc -> kvmhv_counters_tracepoint_regfunc
> kmvhv_counters_tracepoint_unregfunc -> kvmhv_counters_tracepoint_unregfunc
> 

Nice catch!

Reviewed-by: Amit Machhiwal 

> Fixes: e1f288d2f9c6 ("KVM: PPC: Book3S HV nestedv2: Add support for reading 
> VPA counters for pseries guests")
> Reported-by: Madhavan Srinivasan 
> Signed-off-by: Kajol Jain 
> ---
>  arch/powerpc/include/asm/kvm_book3s_64.h | 4 ++--
>  arch/powerpc/kvm/book3s_hv.c | 4 ++--
>  arch/powerpc/kvm/trace_hv.h  | 2 +-
>  3 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
> b/arch/powerpc/include/asm/kvm_book3s_64.h
> index 2ef9a5f4e5d1..11065313d4c1 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_64.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_64.h
> @@ -684,8 +684,8 @@ int kvmhv_nestedv2_set_ptbl_entry(unsigned long lpid, u64 
> dw0, u64 dw1);
>  int kvmhv_nestedv2_parse_output(struct kvm_vcpu *vcpu);
>  int kvmhv_nestedv2_set_vpa(struct kvm_vcpu *vcpu, unsigned long vpa);
>  
> -int kmvhv_counters_tracepoint_regfunc(void);
> -void kmvhv_counters_tracepoint_unregfunc(void);
> +int kvmhv_counters_tracepoint_regfunc(void);
> +void kvmhv_counters_tracepoint_unregfunc(void);
>  int kvmhv_get_l2_counters_status(void);
>  void kvmhv_set_l2_counters_status(int cpu, bool status);
>  
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index ba0492f9de65..c36d036d7155 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -4154,7 +4154,7 @@ void kvmhv_set_l2_counters_status(int cpu, bool status)
>   lppaca_of(cpu).l2_counters_enable = 0;
>  }
>  
> -int kmvhv_counters_tracepoint_regfunc(void)
> +int kvmhv_counters_tracepoint_regfunc(void)
>  {
>   int cpu;
>  
> @@ -4164,7 +4164,7 @@ int kmvhv_counters_tracepoint_regfunc(void)
>   return 0;
>  }
>  
> -void kmvhv_counters_tracepoint_unregfunc(void)
> +void kvmhv_counters_tracepoint_unregfunc(void)
>  {
>   int cpu;
>  
> diff --git a/arch/powerpc/kvm/trace_hv.h b/arch/powerpc/kvm/trace_hv.h
> index 77ebc724e6cd..35fccaa575cc 100644
> --- a/arch/powerpc/kvm/trace_hv.h
> +++ b/arch/powerpc/kvm/trace_hv.h
> @@ -538,7 +538,7 @@ TRACE_EVENT_FN_COND(kvmppc_vcpu_stats,
>   TP_printk("VCPU %d: l1_to_l2_cs_time=%llu ns l2_to_l1_cs_time=%llu ns 
> l2_runtime=%llu ns",
>   __entry->vcpu_id,  __entry->l1_to_l2_cs,
>   __entry->l2_to_l1_cs, __entry->l2_runtime),
> - kmvhv_counters_tracepoint_regfunc, kmvhv_counters_tracepoint_unregfunc
> + kvmhv_counters_tracepoint_regfunc, kvmhv_counters_tracepoint_unregfunc
>  );
>  #endif
>  #endif /* _TRACE_KVM_HV_H */
> -- 
> 2.43.5
> 
> 



Re: [PATCH v3] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-08-19 Thread Amit Machhiwal
Hi Michael,

On 2024/08/17 08:59 AM, Michael Ellerman wrote:
> Amit Machhiwal  writes:
> > On 2024/08/15 01:20 PM, Michael Ellerman wrote:
> >> Bjorn Helgaas  writes:
> >> > On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote:
> >> >> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> >> >> of a PCI device attached to a PCI-bridge causes following kernel Oops on
> >> >> a pseries KVM guest:
> >> >
> >> > What is unique about pseries here?  There's nothing specific to
> >> > pseries in the patch, so I would expect this to be a generic problem
> >> > on any arch.
> >> >
> >> >>  RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
> >> >>  Kernel attempted to read user page (10ec0048) - exploit attempt? 
> >> >> (uid: 0)
> >> >>  BUG: Unable to handle kernel data access on read at 0x10ec0048
> >> >
> >> > Weird address.  I would expect NULL or something.  Where did this
> >> > non-NULL pointer come from?
> >> 
> >> It originally comes from np->data, which is supposed to be an
> >> of_changeset.
> >> 
> >> The powerpc code also uses np->data for the struct pci_dn pointer, see
> >> pci_add_device_node_info().
> >> 
> >> I wonder if that's why it's non-NULL?
> >
> > I'm also looking into the code to figure out where's that value coming 
> > from. I
> > will update as soon as I get there.
> 
> Thanks.
>  
> >> Amit, do we have exact steps to reproduce this? I poked around a bit but
> >> couldn't get it to trigger.
> >
> > Sure, below are the steps:
> >
> > 1. Set CONFIG_PCI_DYNAMIC_OF_NODES=y in the kernel config and compile 
> > (Fedora
> >has it disabled in it's distro config, Ubuntu has it enabled but will 
> > have it
> >disabled in the next update)
> >
> > 2. If you are using Fedora cloud images, make sure you've these packages
> >installed:
> > $ rpm -qa | grep -e 'ppc64-diag\|powerpc-utils'
> > powerpc-utils-core-1.3.11-6.fc40.ppc64le
> > powerpc-utils-1.3.11-6.fc40.ppc64le
> > ppc64-diag-rtas-2.7.9-6.fc40.ppc64le
> > ppc64-diag-2.7.9-6.fc40.ppc64le
> >
> > 3. Hotplug a pci device as follows:
> > virsh attach-interface  bridge --source virbr0
> 
> I don't use virsh :)

No worries. Fortunately, we do have a way to do it with qemu monitor.

> 
> Any idea how to do it with just qemu monitor commands?
> 

1. Boot the guest with the below included in the qemu cmdline:

-netdev 
bridge,id=,br=virbr0,helper=/usr/libexec/qemu-bridge-helper

2. Once the guest boots, run the below command on qemu monitor to hot-plug a pci
   device:

device_add rtl8139,netdev=,mac=52:54:00:88:31:28,id=

dmesg
=
[  116.968210] pci :00:01.0: [10ec:8139] type 00 class 0x02 
conventional PCI endpoint
[  116.969260] pci :00:01.0: BAR 0 [io  0x1-0x100ff]
[  116.969904] pci :00:01.0: BAR 1 [mem 0x-0x00ff]
[  116.970745] pci :00:01.0: ROM [mem 0x-0x0003 pref]
[  116.971456] pci :00:01.0: No hypervisor support for SR-IOV on this 
device, IOV BARs disabled.
[  116.972583] pci :00:01.0: Adding to iommu group 0
[  116.978466] pci :00:01.0: ROM [mem 0x20008008-0x2000800b 
pref]: assigned
[  116.979347] pci :00:01.0: BAR 0 [io  0x10400-0x104ff]: assigned
[  116.980063] pci :00:01.0: BAR 1 [mem 0x200080001000-0x2000800010ff]: 
assigned
[  117.017187] 8139cp: 8139cp: 10/100 PCI Ethernet driver v1.3 (Mar 22, 
2004)
[  117.018577] 8139cp :00:01.0: enabling device ( -> 0003)
[  117.025414] 8139cp :00:01.0 eth1: RTL-8139C+ at 0xfbf09e59, 
52:54:00:88:31:28, IRQ 26
[  117.051028] 8139too: 8139too Fast Ethernet driver 0.9.28
[  117.076577] 8139cp :00:01.0 eth1: link up, 100Mbps, full-duplex, lpa 
0x05E1

3. Try hot-unplug of the device to recreate the kernel Oops.

device_del 

Thanks,
Amit

> cheers



Re: [PATCH v3] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-08-16 Thread Amit Machhiwal
Hi Michael,

On 2024/08/15 01:20 PM, Michael Ellerman wrote:
> Bjorn Helgaas  writes:
> > On Sat, Aug 03, 2024 at 12:03:25AM +0530, Amit Machhiwal wrote:
> >> With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> >> of a PCI device attached to a PCI-bridge causes following kernel Oops on
> >> a pseries KVM guest:
> >
> > What is unique about pseries here?  There's nothing specific to
> > pseries in the patch, so I would expect this to be a generic problem
> > on any arch.
> >
> >>  RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
> >>  Kernel attempted to read user page (10ec0048) - exploit attempt? 
> >> (uid: 0)
> >>  BUG: Unable to handle kernel data access on read at 0x10ec0048
> >
> > Weird address.  I would expect NULL or something.  Where did this
> > non-NULL pointer come from?
> 
> It originally comes from np->data, which is supposed to be an
> of_changeset.
> 
> The powerpc code also uses np->data for the struct pci_dn pointer, see
> pci_add_device_node_info().
> 
> I wonder if that's why it's non-NULL?

I'm also looking into the code to figure out where's that value coming from. I
will update as soon as I get there.

> 
> Amit, do we have exact steps to reproduce this? I poked around a bit but
> couldn't get it to trigger.

Sure, below are the steps:

1. Set CONFIG_PCI_DYNAMIC_OF_NODES=y in the kernel config and compile (Fedora
   has it disabled in it's distro config, Ubuntu has it enabled but will have it
   disabled in the next update)

2. If you are using Fedora cloud images, make sure you've these packages
   installed:
$ rpm -qa | grep -e 'ppc64-diag\|powerpc-utils'
powerpc-utils-core-1.3.11-6.fc40.ppc64le
powerpc-utils-1.3.11-6.fc40.ppc64le
ppc64-diag-rtas-2.7.9-6.fc40.ppc64le
ppc64-diag-2.7.9-6.fc40.ppc64le

3. Hotplug a pci device as follows:
virsh attach-interface  bridge --source virbr0

4. Check if the pci device was added by running `ip a s`

5. Try hot-unplug of that device by supplying the MAC, which should trigger the
   Oops
virsh detach-interface  bridge 

Thanks,
Amit

> cheers



[PATCH v3] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-08-02 Thread Amit Machhiwal
With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
of a PCI device attached to a PCI-bridge causes following kernel Oops on
a pseries KVM guest:

 RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
 Kernel attempted to read user page (10ec0048) - exploit attempt? (uid: 0)
 BUG: Unable to handle kernel data access on read at 0x10ec0048
 Faulting instruction address: 0xc12d8728
 Oops: Kernel access of bad area, sig: 11 [#1]
 LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries

 NIP [c12d8728] __of_changeset_entry_invert+0x10/0x1ac
 LR [c12da7f0] __of_changeset_revert_entries+0x98/0x180
 Call Trace:
 [cbcc3970] [c12daa60] of_changeset_revert+0x58/0xd8
 [cbcc39c0] [c0d0ed78] of_pci_remove_node+0x74/0xb0
 [cbcc39f0] [c0cdcfe0] pci_stop_bus_device+0xf4/0x138
 [cbcc3a30] [c0cdd140] 
pci_stop_and_remove_bus_device_locked+0x34/0x64
 [cbcc3a60] [c0cf3780] remove_store+0xf0/0x108
 [cbcc3ab0] [c0e89e04] dev_attr_store+0x34/0x78
 [cbcc3ad0] [c07f8dd4] sysfs_kf_write+0x70/0xa4
 [cbcc3af0] [c07f7248] kernfs_fop_write_iter+0x1d0/0x2e0
 [cbcc3b40] [c06c9b08] vfs_write+0x27c/0x558
 [cbcc3bf0] [c06ca168] ksys_write+0x90/0x170
 [cbcc3c40] [c0033248] system_call_exception+0xf8/0x290
 [cbcc3e50] [c000d05c] system_call_vectored_common+0x15c/0x2ec


A git bisect pointed this regression to be introduced via [1] that added
a mechanism to create device tree nodes for parent PCI bridges when a
PCI device is hot-plugged.

The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
device-tree node associated with the pci_dev that was earlier
hot-plugged and was attached under a pci-bridge. The PCI dev header
`dev->hdr_type` being 0, results a conditional check done with
`pci_is_bridge()` into false. Consequently, a call to
`of_pci_make_dev_node()` to create a device node is never made. When at
a later point in time, in the device node removal path, a memcpy is
attempted in `__of_changeset_entry_invert()`; since the device node was
never created, results in an Oops due to kernel read access to a bad
address.

To fix this issue, the patch introduces a new flag OF_CREATE_WITH_CSET
to keep track of device nodes created via `of_pci_make_dev_node()` and
later attempt to destroy only such device nodes which have this flag
set.

[1] commit 407d1a51921e ("PCI: Create device tree node for bridge")

Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
Reported-by: Kowshik Jois B S 
Signed-off-by: Lizhi Hou 
Signed-off-by: Amit Machhiwal 
---
Changes since v2:
* Drop v2 changes and introduce a different approach from Lizhi discussed
  over the v2 of this patch
* V2: 
https://lore.kernel.org/all/20240715080726.2496198-1-amach...@linux.ibm.com/
Changes since v1:
* Included Lizhi's suggested changes on V1
* Fixed below two warnings from Lizhi's changes and rearranged the cleanup
  part a bit in `of_pci_make_dev_node`
drivers/pci/of.c:611:6: warning: no previous prototype for 
‘of_pci_free_node’ [-Wmissing-prototypes]
  611 | void of_pci_free_node(struct device_node *np)
  |  ^~~~   
drivers/pci/of.c: In function ‘of_pci_make_dev_node’:
drivers/pci/of.c:696:1: warning: label ‘out_destroy_cset’ defined but 
not used [-Wunused-label]
  696 | out_destroy_cset:   
  | ^~~~  
* V1: 
https://lore.kernel.org/all/20240703141634.2974589-1-amach...@linux.ibm.com/

 drivers/pci/of.c   | 3 ++-
 include/linux/of.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/of.c b/drivers/pci/of.c
index dacea3fc5128..bc455370143e 100644
--- a/drivers/pci/of.c
+++ b/drivers/pci/of.c
@@ -653,7 +653,7 @@ void of_pci_remove_node(struct pci_dev *pdev)
struct device_node *np;
 
np = pci_device_to_OF_node(pdev);
-   if (!np || !of_node_check_flag(np, OF_DYNAMIC))
+   if (!np || !of_node_check_flag(np, OF_CREATE_WITH_CSET))
return;
pdev->dev.of_node = NULL;
 
@@ -712,6 +712,7 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
if (ret)
goto out_free_node;
 
+   of_node_set_flag(np, OF_CREATE_WITH_CSET);
np->data = cset;
pdev->dev.of_node = np;
kfree(name);
diff --git a/include/linux/of.h b/include/linux/of.h
index 85b60ac9eec5..5faa5a1198c6 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -153,6 +153,7 @@ extern struct device_node *of_stdout;
 #define OF_POPULATED_BUS   4 /* platform bus created for children */
 #define OF_OVERLAY 5 /* allocated for an overlay */
 #define OF_OVERLAY_FREE_CSET   6 /* in overlay cset being freed */
+#define OF_CREATE_WITH_CSET7 /* Created

Re: [PATCH v2] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-08-02 Thread Amit Machhiwal
Hi Stefan,

On 2024/07/29 03:27 PM, Stefan Bader wrote:
> On 26.07.24 13:37, Rob Herring wrote:
> > + Ubuntu kernel list, again
> > 
> > On Thu, Jul 25, 2024 at 11:15:39PM +0530, Amit Machhiwal wrote:
> > > Hi Lizhi, Rob,
> > > 
> > > Sorry for responding late. I got busy with some other things.
> > > 
> > > On 2024/07/23 02:08 PM, Lizhi Hou wrote:
> > > > 
> > > > On 7/23/24 12:54, Rob Herring wrote:
> > > > > On Tue, Jul 23, 2024 at 12:21 PM Lizhi Hou  wrote:
> > > > > > 
> > > > > > On 7/23/24 09:21, Rob Herring wrote:
> > > > > > > On Mon, Jul 15, 2024 at 01:52:30PM -0700, Lizhi Hou wrote:
> > > > > > > > On 7/15/24 11:55, Rob Herring wrote:
> > > > > > > > > On Mon, Jul 15, 2024 at 2:08 AM Amit Machhiwal 
> > > > > > > > >  wrote:
> > > > > > > > > > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and 
> > > > > > > > > > hot-unplug sequence
> > > > > > > > > > of a PCI device attached to a PCI-bridge causes following 
> > > > > > > > > > kernel Oops on
> > > > > > > > > > a pseries KVM guest:
> > > > > > > > > > 
> > > > > > > > > >  RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
> > > > > > > > > >  Kernel attempted to read user page (10ec0048) - 
> > > > > > > > > > exploit attempt? (uid: 0)
> > > > > > > > > >  BUG: Unable to handle kernel data access on read at 
> > > > > > > > > > 0x10ec0048
> > > > > > > > > >  Faulting instruction address: 0xc12d8728
> > > > > > > > > >  Oops: Kernel access of bad area, sig: 11 [#1]
> > > > > > > > > >  LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA 
> > > > > > > > > > pSeries
> > > > > > > > > > 
> > > > > > > > > >  NIP [c12d8728] 
> > > > > > > > > > __of_changeset_entry_invert+0x10/0x1ac
> > > > > > > > > >  LR [c12da7f0] 
> > > > > > > > > > __of_changeset_revert_entries+0x98/0x180
> > > > > > > > > >  Call Trace:
> > > > > > > > > >  [cbcc3970] [c12daa60] 
> > > > > > > > > > of_changeset_revert+0x58/0xd8
> > > > > > > > > >  [cbcc39c0] [c0d0ed78] 
> > > > > > > > > > of_pci_remove_node+0x74/0xb0
> > > > > > > > > >  [cbcc39f0] [c0cdcfe0] 
> > > > > > > > > > pci_stop_bus_device+0xf4/0x138
> > > > > > > > > >  [cbcc3a30] [c0cdd140] 
> > > > > > > > > > pci_stop_and_remove_bus_device_locked+0x34/0x64
> > > > > > > > > >  [cbcc3a60] [c0cf3780] 
> > > > > > > > > > remove_store+0xf0/0x108
> > > > > > > > > >  [cbcc3ab0] [c0e89e04] 
> > > > > > > > > > dev_attr_store+0x34/0x78
> > > > > > > > > >  [cbcc3ad0] [c07f8dd4] 
> > > > > > > > > > sysfs_kf_write+0x70/0xa4
> > > > > > > > > >  [cbcc3af0] [c07f7248] 
> > > > > > > > > > kernfs_fop_write_iter+0x1d0/0x2e0
> > > > > > > > > >  [cbcc3b40] [c06c9b08] 
> > > > > > > > > > vfs_write+0x27c/0x558
> > > > > > > > > >  [cbcc3bf0] [c06ca168] 
> > > > > > > > > > ksys_write+0x90/0x170
> > > > > > > > > >  [cbcc3c40] [c0033248] 
> > > > > > > > > > system_call_exception+0xf8/0x290
> > > > > > > > > >  [cbcc3e50] [c000d05c] 
> > > > > > > > > > system_call_vectored_common+0x15c/0x2ec
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > A 

Re: [PATCH v2] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-07-29 Thread Amit Machhiwal
Hi Lizhi,

On 2024/07/29 09:47 AM, Lizhi Hou wrote:
> Hi Amit
> 
> On 7/29/24 04:13, Amit Machhiwal wrote:
> > Hi Lizhi,
> > 
> > On 2024/07/26 11:45 AM, Lizhi Hou wrote:
> > > On 7/26/24 10:52, Rob Herring wrote:
> > > > On Thu, Jul 25, 2024 at 6:06 PM Lizhi Hou  wrote:
> > > > > Hi Amit,
> > > > > 
> > > > > 
> > > > > I try to follow the option which add a OF flag. If Rob is ok with 
> > > > > this,
> > > > > I would suggest to use it instead of V1 patch
> > > > > 
> > > > > diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> > > > > index dda6092e6d3a..a401ed0463d9 100644
> > > > > --- a/drivers/of/dynamic.c
> > > > > +++ b/drivers/of/dynamic.c
> > > > > @@ -382,6 +382,11 @@ void of_node_release(struct kobject *kobj)
> > > > >   __func__, node);
> > > > >}
> > > > > 
> > > > > +   if (of_node_check_flag(node, OF_CREATED_WITH_CSET)) {
> > > > > +   of_changeset_revert(node->data);
> > > > > +   of_changeset_destroy(node->data);
> > > > > +   }
> > > > What happens if multiple nodes are created in the changeset?
> > > Ok. multiple nodes will not work.
> > > > > +
> > > > >if (node->child)
> > > > >pr_err("ERROR: %s() unexpected children for 
> > > > > %pOF/%s\n",
> > > > >__func__, node->parent, node->full_name);
> > > > > @@ -507,6 +512,7 @@ struct device_node 
> > > > > *of_changeset_create_node(struct
> > > > > of_changeset *ocs,
> > > > >np = __of_node_dup(NULL, full_name);
> > > > >if (!np)
> > > > >return NULL;
> > > > > +   of_node_set_flag(np, OF_CREATED_WITH_CSET);
> > > > This should be set where the data ptr is set.
> > > Ok. It sounds the fix could be simplified to 3 lines change.
> > Thanks for the patch. The hot-plug and hot-unplug of PCI device seem to work
> > fine as expected. I see this patch would attempt to remove only the nodes 
> > which
> > were created in `of_pci_make_dev_node()` with the help of the newly 
> > introduced
> > flag, which looks good to me.
> > 
> > Also, since a call to `of_pci_make_dev_node()` from `pci_bus_add_device()`, 
> > that
> > creates devices nodes only for bridge devices, is conditional on
> > `pci_is_bridge()`, it only makes sense to retain the logical symmetry and 
> > call
> > `of_pci_remove_node()` conditionally on `pci_is_bridge()` as well in
> > `pci_stop_dev()`. Hence, I would like to propose the below change along 
> > with the
> > above patch:
> > 
> > diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> > index 910387e5bdbf..c6394bf562cd 100644
> > --- a/drivers/pci/remove.c
> > +++ b/drivers/pci/remove.c
> > @@ -23,7 +23,8 @@ static void pci_stop_dev(struct pci_dev *dev)
> >  device_release_driver(&dev->dev);
> >  pci_proc_detach_device(dev);
> >  pci_remove_sysfs_dev_files(dev);
> > -   of_pci_remove_node(dev);
> > +   if (pci_is_bridge(dev))
> > +   of_pci_remove_node(dev);
> >  pci_dev_assign_added(dev, false);
> >  }
> > 
> > Please let me know of your thoughts on this and based on that I can spin 
> > the v3
> > of this patch.
> 
> As I mentioned, there are endpoints in pci quirks (pci/quirks.c) will also
> create nodes by of_pci_make_dev_node(). So please remove above two lines.

Sorry if I'm misinterpreting something here but as I mentioned,
`of_pci_make_dev_node()` is called only for bridge devices with check performed
via `pci_is_bridge()`, could you please elaborate more on why the same check
can't be put while removing the node via `of_pci_remove_node()`?

Thanks,
Amit

> 
> Thanks,
> 
> Lizhi
> 
> > 
> > In addition to this, can this patch be taken as part of 6.11 as a bug fix?
> > 
> > Thanks,
> > Amit
> > 
> > > 
> > > diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> > > index 51e3dd0ea5ab..0b3ba1e1b18c 100644
> > > --- a/drivers/pci/of.c
> > > +++ b/drivers/pci/of.c
> > > @@ -613,7 +613,7 @@ void of_pci_remove_node(str

Re: [PATCH v2] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-07-29 Thread Amit Machhiwal
Hi Lizhi,

On 2024/07/26 11:45 AM, Lizhi Hou wrote:
> 
> On 7/26/24 10:52, Rob Herring wrote:
> > On Thu, Jul 25, 2024 at 6:06 PM Lizhi Hou  wrote:
> > > Hi Amit,
> > > 
> > > 
> > > I try to follow the option which add a OF flag. If Rob is ok with this,
> > > I would suggest to use it instead of V1 patch
> > > 
> > > diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> > > index dda6092e6d3a..a401ed0463d9 100644
> > > --- a/drivers/of/dynamic.c
> > > +++ b/drivers/of/dynamic.c
> > > @@ -382,6 +382,11 @@ void of_node_release(struct kobject *kobj)
> > >  __func__, node);
> > >   }
> > > 
> > > +   if (of_node_check_flag(node, OF_CREATED_WITH_CSET)) {
> > > +   of_changeset_revert(node->data);
> > > +   of_changeset_destroy(node->data);
> > > +   }
> > What happens if multiple nodes are created in the changeset?
> Ok. multiple nodes will not work.
> > 
> > > +
> > >   if (node->child)
> > >   pr_err("ERROR: %s() unexpected children for %pOF/%s\n",
> > >   __func__, node->parent, node->full_name);
> > > @@ -507,6 +512,7 @@ struct device_node *of_changeset_create_node(struct
> > > of_changeset *ocs,
> > >   np = __of_node_dup(NULL, full_name);
> > >   if (!np)
> > >   return NULL;
> > > +   of_node_set_flag(np, OF_CREATED_WITH_CSET);
> > This should be set where the data ptr is set.
> 
> Ok. It sounds the fix could be simplified to 3 lines change.

Thanks for the patch. The hot-plug and hot-unplug of PCI device seem to work
fine as expected. I see this patch would attempt to remove only the nodes which
were created in `of_pci_make_dev_node()` with the help of the newly introduced
flag, which looks good to me.

Also, since a call to `of_pci_make_dev_node()` from `pci_bus_add_device()`, that
creates devices nodes only for bridge devices, is conditional on
`pci_is_bridge()`, it only makes sense to retain the logical symmetry and call
`of_pci_remove_node()` conditionally on `pci_is_bridge()` as well in
`pci_stop_dev()`. Hence, I would like to propose the below change along with the
above patch:

diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index 910387e5bdbf..c6394bf562cd 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -23,7 +23,8 @@ static void pci_stop_dev(struct pci_dev *dev)
device_release_driver(&dev->dev);
pci_proc_detach_device(dev);
pci_remove_sysfs_dev_files(dev);
-   of_pci_remove_node(dev);
+   if (pci_is_bridge(dev))
+   of_pci_remove_node(dev);
 
pci_dev_assign_added(dev, false);
}

Please let me know of your thoughts on this and based on that I can spin the v3
of this patch.

In addition to this, can this patch be taken as part of 6.11 as a bug fix?

Thanks,
Amit

> 
> 
> diff --git a/drivers/pci/of.c b/drivers/pci/of.c
> index 51e3dd0ea5ab..0b3ba1e1b18c 100644
> --- a/drivers/pci/of.c
> +++ b/drivers/pci/of.c
> @@ -613,7 +613,7 @@ void of_pci_remove_node(struct pci_dev *pdev)
>     struct device_node *np;
> 
>     np = pci_device_to_OF_node(pdev);
> -   if (!np || !of_node_check_flag(np, OF_DYNAMIC))
> +   if (!np || !of_node_check_flag(np, OF_CREATED_WITH_CSET))
>     return;
>     pdev->dev.of_node = NULL;
> 
> @@ -672,6 +672,7 @@ void of_pci_make_dev_node(struct pci_dev *pdev)
>     if (ret)
>     goto out_free_node;
> 
> +   of_node_set_flag(np, OF_CREATED_WITH_CSET);
>     np->data = cset;
>     pdev->dev.of_node = np;
>     kfree(name);
> diff --git a/include/linux/of.h b/include/linux/of.h
> index a0bedd038a05..a46317f6626e 100644
> --- a/include/linux/of.h
> +++ b/include/linux/of.h
> @@ -153,6 +153,7 @@ extern struct device_node *of_stdout;
>  #define OF_POPULATED_BUS   4 /* platform bus created for children */
>  #define OF_OVERLAY 5 /* allocated for an overlay */
>  #define OF_OVERLAY_FREE_CSET   6 /* in overlay cset being freed */
> +#define OF_CREATED_WITH_CSET    7 /* created by of_changeset_create_node */
> 
>  #define OF_BAD_ADDR    ((u64)-1)
> 
> 
> Lizhi
> 
> > 
> > Rob


Re: [PATCH v2] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-07-25 Thread Amit Machhiwal
Hi Bjorn,

On 2024/07/25 03:55 PM, Bjorn Helgaas wrote:
> On Thu, Jul 25, 2024 at 11:15:39PM +0530, Amit Machhiwal wrote:
> > ...
> > The crash in question is a critical issue that we would want to have
> > a fix for soon. And while this is still being figured out, is it
> > okay to go with the fix I proposed in the V1 of this patch?
> 
> v6.10 has been released already, and it will be a couple months before
> the v6.11 release.
> 
> It looks like the regression is 407d1a51921e, which appeared in v6.6,
> almost a year ago, so it's fairly old.
> 
> What target are you thinking about for the V1 patch?  I guess if we
> add it as a v6.11 post-merge window fix, it might get backported to
> stable kernels before v6.11?  

Yes, I think we can go ahead with taking V1 patch for v6.11 post-merge window to
fix the current bug and ask Ubuntu to pick it while Lizhi's proposed patch goes
under test and review.

Thanks,
Amit


Re: [PATCH v2] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-07-25 Thread Amit Machhiwal
Hi Lizhi, Rob,

Sorry for responding late. I got busy with some other things.

On 2024/07/23 02:08 PM, Lizhi Hou wrote:
> 
> On 7/23/24 12:54, Rob Herring wrote:
> > On Tue, Jul 23, 2024 at 12:21 PM Lizhi Hou  wrote:
> > > 
> > > On 7/23/24 09:21, Rob Herring wrote:
> > > > On Mon, Jul 15, 2024 at 01:52:30PM -0700, Lizhi Hou wrote:
> > > > > On 7/15/24 11:55, Rob Herring wrote:
> > > > > > On Mon, Jul 15, 2024 at 2:08 AM Amit Machhiwal 
> > > > > >  wrote:
> > > > > > > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug 
> > > > > > > sequence
> > > > > > > of a PCI device attached to a PCI-bridge causes following kernel 
> > > > > > > Oops on
> > > > > > > a pseries KVM guest:
> > > > > > > 
> > > > > > > RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
> > > > > > > Kernel attempted to read user page (10ec0048) - exploit 
> > > > > > > attempt? (uid: 0)
> > > > > > > BUG: Unable to handle kernel data access on read at 
> > > > > > > 0x10ec0048
> > > > > > > Faulting instruction address: 0xc12d8728
> > > > > > > Oops: Kernel access of bad area, sig: 11 [#1]
> > > > > > > LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > > > > > > 
> > > > > > > NIP [c12d8728] __of_changeset_entry_invert+0x10/0x1ac
> > > > > > > LR [c12da7f0] __of_changeset_revert_entries+0x98/0x180
> > > > > > > Call Trace:
> > > > > > > [cbcc3970] [c12daa60] 
> > > > > > > of_changeset_revert+0x58/0xd8
> > > > > > > [cbcc39c0] [c0d0ed78] 
> > > > > > > of_pci_remove_node+0x74/0xb0
> > > > > > > [cbcc39f0] [c0cdcfe0] 
> > > > > > > pci_stop_bus_device+0xf4/0x138
> > > > > > > [cbcc3a30] [c0cdd140] 
> > > > > > > pci_stop_and_remove_bus_device_locked+0x34/0x64
> > > > > > > [cbcc3a60] [c0cf3780] remove_store+0xf0/0x108
> > > > > > > [cbcc3ab0] [c0e89e04] dev_attr_store+0x34/0x78
> > > > > > > [cbcc3ad0] [c07f8dd4] sysfs_kf_write+0x70/0xa4
> > > > > > > [cbcc3af0] [c07f7248] 
> > > > > > > kernfs_fop_write_iter+0x1d0/0x2e0
> > > > > > > [cbcc3b40] [c06c9b08] vfs_write+0x27c/0x558
> > > > > > > [cbcc3bf0] [c06ca168] ksys_write+0x90/0x170
> > > > > > > [cbcc3c40] [c0033248] 
> > > > > > > system_call_exception+0xf8/0x290
> > > > > > > [cbcc3e50] [c000d05c] 
> > > > > > > system_call_vectored_common+0x15c/0x2ec
> > > > > > > 
> > > > > > > 
> > > > > > > A git bisect pointed this regression to be introduced via [1] 
> > > > > > > that added
> > > > > > > a mechanism to create device tree nodes for parent PCI bridges 
> > > > > > > when a
> > > > > > > PCI device is hot-plugged.
> > > > > > > 
> > > > > > > The Oops is caused when `pci_stop_dev()` tries to remove a 
> > > > > > > non-existing
> > > > > > > device-tree node associated with the pci_dev that was earlier
> > > > > > > hot-plugged and was attached under a pci-bridge. The PCI dev 
> > > > > > > header
> > > > > > > `dev->hdr_type` being 0, results a conditional check done with
> > > > > > > `pci_is_bridge()` into false. Consequently, a call to
> > > > > > > `of_pci_make_dev_node()` to create a device node is never made. 
> > > > > > > When at
> > > > > > > a later point in time, in the device node removal path, a memcpy 
> > > > > > > is
> > > > > > > attempted in `__of_changeset_entry_invert()`; since the device 
> > > > > > > node was
> > > > > > > never created, results in an Oops due to kernel read access to a 
> > > >

[PATCH v2] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-07-15 Thread Amit Machhiwal
With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
of a PCI device attached to a PCI-bridge causes following kernel Oops on
a pseries KVM guest:

 RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
 Kernel attempted to read user page (10ec0048) - exploit attempt? (uid: 0)
 BUG: Unable to handle kernel data access on read at 0x10ec0048
 Faulting instruction address: 0xc12d8728
 Oops: Kernel access of bad area, sig: 11 [#1]
 LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries

 NIP [c12d8728] __of_changeset_entry_invert+0x10/0x1ac
 LR [c12da7f0] __of_changeset_revert_entries+0x98/0x180
 Call Trace:
 [cbcc3970] [c12daa60] of_changeset_revert+0x58/0xd8
 [cbcc39c0] [c0d0ed78] of_pci_remove_node+0x74/0xb0
 [cbcc39f0] [c0cdcfe0] pci_stop_bus_device+0xf4/0x138
 [cbcc3a30] [c0cdd140] 
pci_stop_and_remove_bus_device_locked+0x34/0x64
 [cbcc3a60] [c0cf3780] remove_store+0xf0/0x108
 [cbcc3ab0] [c0e89e04] dev_attr_store+0x34/0x78
 [cbcc3ad0] [c07f8dd4] sysfs_kf_write+0x70/0xa4
 [cbcc3af0] [c07f7248] kernfs_fop_write_iter+0x1d0/0x2e0
 [cbcc3b40] [c06c9b08] vfs_write+0x27c/0x558
 [cbcc3bf0] [c06ca168] ksys_write+0x90/0x170
 [cbcc3c40] [c0033248] system_call_exception+0xf8/0x290
 [cbcc3e50] [c000d05c] system_call_vectored_common+0x15c/0x2ec


A git bisect pointed this regression to be introduced via [1] that added
a mechanism to create device tree nodes for parent PCI bridges when a
PCI device is hot-plugged.

The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
device-tree node associated with the pci_dev that was earlier
hot-plugged and was attached under a pci-bridge. The PCI dev header
`dev->hdr_type` being 0, results a conditional check done with
`pci_is_bridge()` into false. Consequently, a call to
`of_pci_make_dev_node()` to create a device node is never made. When at
a later point in time, in the device node removal path, a memcpy is
attempted in `__of_changeset_entry_invert()`; since the device node was
never created, results in an Oops due to kernel read access to a bad
address.

To fix this issue, the patch updates `of_changeset_create_node()` to
allocate a new node only when the device node doesn't exist and init it
in case it does already. Also, introduce `of_pci_free_node()` to be
called to only revert and destroy the changeset device node that was
created via a call to `of_changeset_create_node()`.

[1] commit 407d1a51921e ("PCI: Create device tree node for bridge")

Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
Reported-by: Kowshik Jois B S 
Signed-off-by: Lizhi Hou 
Signed-off-by: Amit Machhiwal 
---
Changes since v1:
* Included Lizhi's suggested changes on V1
* Fixed below two warnings from Lizhi's changes and rearranged the cleanup
  part a bit in `of_pci_make_dev_node`
drivers/pci/of.c:611:6: warning: no previous prototype for 
‘of_pci_free_node’ [-Wmissing-prototypes]
  611 | void of_pci_free_node(struct device_node *np)
  |  ^~~~   
drivers/pci/of.c: In function ‘of_pci_make_dev_node’:
drivers/pci/of.c:696:1: warning: label ‘out_destroy_cset’ defined but 
not used [-Wunused-label]
  696 | out_destroy_cset:   
  | ^~~~  
* V1: 
https://lore.kernel.org/all/20240703141634.2974589-1-amach...@linux.ibm.com/

 drivers/of/dynamic.c  | 16 
 drivers/of/unittest.c |  2 +-
 drivers/pci/bus.c |  3 +--
 drivers/pci/of.c  | 39 ++-
 drivers/pci/pci.h |  2 ++
 include/linux/of.h|  1 +
 6 files changed, 43 insertions(+), 20 deletions(-)

diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
index dda6092e6d3a..9bba5e82a384 100644
--- a/drivers/of/dynamic.c
+++ b/drivers/of/dynamic.c
@@ -492,21 +492,29 @@ struct device_node *__of_node_dup(const struct 
device_node *np,
  * a given changeset.
  *
  * @ocs: Pointer to changeset
+ * @np: Pointer to device node. If null, allocate a new node. If not, init an
+ * existing one.
  * @parent: Pointer to parent device node
  * @full_name: Node full name
  *
  * Return: Pointer to the created device node or NULL in case of an error.
  */
 struct device_node *of_changeset_create_node(struct of_changeset *ocs,
+struct device_node *np,
 struct device_node *parent,
 const char *full_name)
 {
-   struct device_node *np;
int ret;
 
-   np = __of_node_dup(NULL, full_name);
-   if (!np)
-   return NULL;
+   if (!np) {
+   np = __of_node_dup(NULL, full_name);
+   if (!n

Re: [PATCH] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-07-12 Thread Amit Machhiwal
Hi Lizhi,

On 2024/07/11 02:18 PM, Lizhi Hou wrote:
> 
> On 7/11/24 11:48, Amit Machhiwal wrote:
> > On 2024/07/10 09:48 PM, Lizhi Hou wrote:
> > > On 7/5/24 12:20, Bjorn Helgaas wrote:
> > > > [+cc Lukas, FYI]
> > > > 
> > > > On Wed, Jul 03, 2024 at 07:46:34PM +0530, Amit Machhiwal wrote:
> > > > > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug 
> > > > > sequence
> > > > > of a PCI device attached to a PCI-bridge causes following kernel Oops 
> > > > > on
> > > > > a pseries KVM guest:
> > > > > 
> > > > >RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
> > > > >Kernel attempted to read user page (10ec0048) - exploit 
> > > > > attempt? (uid: 0)
> > > > >BUG: Unable to handle kernel data access on read at 0x10ec0048
> > > > >Faulting instruction address: 0xc12d8728
> > > > >Oops: Kernel access of bad area, sig: 11 [#1]
> > > > >LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > > > > 
> > > > >NIP [c12d8728] __of_changeset_entry_invert+0x10/0x1ac
> > > > >LR [c12da7f0] __of_changeset_revert_entries+0x98/0x180
> > > > >Call Trace:
> > > > >[cbcc3970] [c12daa60] of_changeset_revert+0x58/0xd8
> > > > >[cbcc39c0] [c0d0ed78] of_pci_remove_node+0x74/0xb0
> > > > >[cbcc39f0] [c0cdcfe0] 
> > > > > pci_stop_bus_device+0xf4/0x138
> > > > >[cbcc3a30] [c0cdd140] 
> > > > > pci_stop_and_remove_bus_device_locked+0x34/0x64
> > > > >[cbcc3a60] [c0cf3780] remove_store+0xf0/0x108
> > > > >[cbcc3ab0] [c0e89e04] dev_attr_store+0x34/0x78
> > > > >[cbcc3ad0] [c07f8dd4] sysfs_kf_write+0x70/0xa4
> > > > >[cbcc3af0] [c07f7248] 
> > > > > kernfs_fop_write_iter+0x1d0/0x2e0
> > > > >[cbcc3b40] [c06c9b08] vfs_write+0x27c/0x558
> > > > >[cbcc3bf0] [c06ca168] ksys_write+0x90/0x170
> > > > >[cbcc3c40] [c0033248] 
> > > > > system_call_exception+0xf8/0x290
> > > > >[cbcc3e50] [c000d05c] 
> > > > > system_call_vectored_common+0x15c/0x2ec
> > > > > 
> > > > > 
> > > > > A git bisect pointed this regression to be introduced via [1] that 
> > > > > added
> > > > > a mechanism to create device tree nodes for parent PCI bridges when a
> > > > > PCI device is hot-plugged.
> > > > > 
> > > > > The Oops is caused when `pci_stop_dev()` tries to remove a 
> > > > > non-existing
> > > > > device-tree node associated with the pci_dev that was earlier
> > > > > hot-plugged and was attached under a pci-bridge. The PCI dev header
> > > > > `dev->hdr_type` being 0, results a conditional check done with
> > > > > `pci_is_bridge()` into false. Consequently, a call to
> > > > > `of_pci_make_dev_node()` to create a device node is never made. When 
> > > > > at
> > > > > a later point in time, in the device node removal path, a memcpy is
> > > > > attempted in `__of_changeset_entry_invert()`; since the device node 
> > > > > was
> > > > > never created, results in an Oops due to kernel read access to a bad
> > > > > address.
> > > > > 
> > > > > To fix this issue the patch updates `pci_stop_dev()` to ensure that a
> > > > > call to `of_pci_remove_node()` is only made for pci-bridge devices.
> > > > > 
> > > > > [1] commit 407d1a51921e ("PCI: Create device tree node for bridge")
> > > > > 
> > > > > Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
> > > > > Reported-by: Kowshik Jois B S 
> > > > > Tested-by: Kowshik Jois B S 
> > > > > Signed-off-by: Amit Machhiwal 
> > > > Thanks for the patch and testing!  Would like a reviewed-by from
> > > > Lizhi.
> > > of_pci_make_dev_node() will create of nodes for some endpoint devices 
> > > (not a
> > > bridge) as well. And actually this is the main purpose.
> > > 
> &

Re: [PATCH] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-07-11 Thread Amit Machhiwal
On 2024/07/10 09:48 PM, Lizhi Hou wrote:
> 
> On 7/5/24 12:20, Bjorn Helgaas wrote:
> > [+cc Lukas, FYI]
> > 
> > On Wed, Jul 03, 2024 at 07:46:34PM +0530, Amit Machhiwal wrote:
> > > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> > > of a PCI device attached to a PCI-bridge causes following kernel Oops on
> > > a pseries KVM guest:
> > > 
> > >   RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
> > >   Kernel attempted to read user page (10ec0048) - exploit attempt? 
> > > (uid: 0)
> > >   BUG: Unable to handle kernel data access on read at 0x10ec0048
> > >   Faulting instruction address: 0xc12d8728
> > >   Oops: Kernel access of bad area, sig: 11 [#1]
> > >   LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries
> > > 
> > >   NIP [c12d8728] __of_changeset_entry_invert+0x10/0x1ac
> > >   LR [c12da7f0] __of_changeset_revert_entries+0x98/0x180
> > >   Call Trace:
> > >   [cbcc3970] [c12daa60] of_changeset_revert+0x58/0xd8
> > >   [cbcc39c0] [c0d0ed78] of_pci_remove_node+0x74/0xb0
> > >   [cbcc39f0] [c0cdcfe0] pci_stop_bus_device+0xf4/0x138
> > >   [cbcc3a30] [c0cdd140] 
> > > pci_stop_and_remove_bus_device_locked+0x34/0x64
> > >   [cbcc3a60] [c0cf3780] remove_store+0xf0/0x108
> > >   [cbcc3ab0] [c0e89e04] dev_attr_store+0x34/0x78
> > >   [cbcc3ad0] [c07f8dd4] sysfs_kf_write+0x70/0xa4
> > >   [cbcc3af0] [c07f7248] kernfs_fop_write_iter+0x1d0/0x2e0
> > >   [cbcc3b40] [c06c9b08] vfs_write+0x27c/0x558
> > >   [cbcc3bf0] [c06ca168] ksys_write+0x90/0x170
> > >   [cbcc3c40] [c0033248] system_call_exception+0xf8/0x290
> > >   [cbcc3e50] [c000d05c] 
> > > system_call_vectored_common+0x15c/0x2ec
> > > 
> > > 
> > > A git bisect pointed this regression to be introduced via [1] that added
> > > a mechanism to create device tree nodes for parent PCI bridges when a
> > > PCI device is hot-plugged.
> > > 
> > > The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
> > > device-tree node associated with the pci_dev that was earlier
> > > hot-plugged and was attached under a pci-bridge. The PCI dev header
> > > `dev->hdr_type` being 0, results a conditional check done with
> > > `pci_is_bridge()` into false. Consequently, a call to
> > > `of_pci_make_dev_node()` to create a device node is never made. When at
> > > a later point in time, in the device node removal path, a memcpy is
> > > attempted in `__of_changeset_entry_invert()`; since the device node was
> > > never created, results in an Oops due to kernel read access to a bad
> > > address.
> > > 
> > > To fix this issue the patch updates `pci_stop_dev()` to ensure that a
> > > call to `of_pci_remove_node()` is only made for pci-bridge devices.
> > > 
> > > [1] commit 407d1a51921e ("PCI: Create device tree node for bridge")
> > > 
> > > Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
> > > Reported-by: Kowshik Jois B S 
> > > Tested-by: Kowshik Jois B S 
> > > Signed-off-by: Amit Machhiwal 
> > Thanks for the patch and testing!  Would like a reviewed-by from
> > Lizhi.
> 
> of_pci_make_dev_node() will create of nodes for some endpoint devices (not a
> bridge) as well. And actually this is the main purpose.
> 
> Maybe the patch as below would resolve the Oops?

Thanks for the patch, Lizhi! I tried out this patch and don't see the issue with
the same. The hot-plug and hot-unplug of PCI device seem to work fine as
expected.

~ Amit

> 
> diff --git a/drivers/of/dynamic.c b/drivers/of/dynamic.c
> index dda6092e6d3a..3c693b091ecf 100644
> --- a/drivers/of/dynamic.c
> +++ b/drivers/of/dynamic.c
> @@ -492,21 +492,29 @@ struct device_node *__of_node_dup(const struct
> device_node *np,
>   * a given changeset.
>   *
>   * @ocs: Pointer to changeset
> + * @np: Pointer to device node. If it is not null, init it directly instead
> of
> + *  allocate a new node.
>   * @parent: Pointer to parent device node
>   * @full_name: Node full name
>   *
>   * Return: Pointer to the created device node or NULL in case of an error.
>   */
>  struct device_node *of_changeset_create_node(struct of_changeset *ocs,
> +    stru

Re: [PATCH] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-07-11 Thread Amit Machhiwal
Hi Rob,

On 2024/07/11 06:20 AM, Rob Herring wrote:
> On Wed, Jul 3, 2024 at 8:17 AM Amit Machhiwal  wrote:
> >
> > With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
> > of a PCI device attached to a PCI-bridge causes following kernel Oops on
> > a pseries KVM guest:
> 
> Can I ask why you have this option on in the first place? Do you have
> a use for it or it's just a case of distros turn on every kconfig
> option.

Yes, this option is turned on in Ubuntu's distro kernel config where the issue
was originally reported, while Fedora is keeping this turned off.

root@ubuntu:~# cat /boot/config-6.8.0-38-generic | grep PCI_DYN
CONFIG_PCI_DYNAMIC_OF_NODES=y

root@fedora:~# cat /boot/config-6.9.7-200.fc40.ppc64le | grep PCI_DYN
# CONFIG_PCI_DYNAMIC_OF_NODES is not set

Thanks,
Amit

> 
> Rob


[PATCH] PCI: Fix crash during pci_dev hot-unplug on pseries KVM guest

2024-07-03 Thread Amit Machhiwal
With CONFIG_PCI_DYNAMIC_OF_NODES [1], a hot-plug and hot-unplug sequence
of a PCI device attached to a PCI-bridge causes following kernel Oops on
a pseries KVM guest:

 RTAS: event: 2, Type: Hotplug Event (229), Severity: 1
 Kernel attempted to read user page (10ec0048) - exploit attempt? (uid: 0)
 BUG: Unable to handle kernel data access on read at 0x10ec0048
 Faulting instruction address: 0xc12d8728
 Oops: Kernel access of bad area, sig: 11 [#1]
 LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries

 NIP [c12d8728] __of_changeset_entry_invert+0x10/0x1ac
 LR [c12da7f0] __of_changeset_revert_entries+0x98/0x180
 Call Trace:
 [cbcc3970] [c12daa60] of_changeset_revert+0x58/0xd8
 [cbcc39c0] [c0d0ed78] of_pci_remove_node+0x74/0xb0
 [cbcc39f0] [c0cdcfe0] pci_stop_bus_device+0xf4/0x138
 [cbcc3a30] [c0cdd140] 
pci_stop_and_remove_bus_device_locked+0x34/0x64
 [cbcc3a60] [c0cf3780] remove_store+0xf0/0x108
 [cbcc3ab0] [c0e89e04] dev_attr_store+0x34/0x78
 [cbcc3ad0] [c07f8dd4] sysfs_kf_write+0x70/0xa4
 [cbcc3af0] [c07f7248] kernfs_fop_write_iter+0x1d0/0x2e0
 [cbcc3b40] [c06c9b08] vfs_write+0x27c/0x558
 [cbcc3bf0] [c06ca168] ksys_write+0x90/0x170
 [cbcc3c40] [c0033248] system_call_exception+0xf8/0x290
 [cbcc3e50] [c000d05c] system_call_vectored_common+0x15c/0x2ec


A git bisect pointed this regression to be introduced via [1] that added
a mechanism to create device tree nodes for parent PCI bridges when a
PCI device is hot-plugged.

The Oops is caused when `pci_stop_dev()` tries to remove a non-existing
device-tree node associated with the pci_dev that was earlier
hot-plugged and was attached under a pci-bridge. The PCI dev header
`dev->hdr_type` being 0, results a conditional check done with
`pci_is_bridge()` into false. Consequently, a call to
`of_pci_make_dev_node()` to create a device node is never made. When at
a later point in time, in the device node removal path, a memcpy is
attempted in `__of_changeset_entry_invert()`; since the device node was
never created, results in an Oops due to kernel read access to a bad
address.

To fix this issue the patch updates `pci_stop_dev()` to ensure that a
call to `of_pci_remove_node()` is only made for pci-bridge devices.

[1] commit 407d1a51921e ("PCI: Create device tree node for bridge")

Fixes: 407d1a51921e ("PCI: Create device tree node for bridge")
Reported-by: Kowshik Jois B S 
Tested-by: Kowshik Jois B S 
Signed-off-by: Amit Machhiwal 
---
 drivers/pci/remove.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index d749ea8250d6..4e51c64af416 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -22,7 +22,8 @@ static void pci_stop_dev(struct pci_dev *dev)
device_release_driver(&dev->dev);
pci_proc_detach_device(dev);
pci_remove_sysfs_dev_files(dev);
-   of_pci_remove_node(dev);
+   if (pci_is_bridge(dev))
+   of_pci_remove_node(dev);
 
pci_dev_assign_added(dev, false);
}

base-commit: e9d22f7a6655941fc8b2b942ed354ec780936b3e
-- 
2.45.2



[PATCH v4] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-02-06 Thread Amit Machhiwal
Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
below error as L1 qemu sends PVR value 'arch_compat' == 0 via
ppc_set_compat ioctl. This triggers a condition failure in
kvmppc_set_arch_compat() resulting in an EINVAL.

qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid
argument

Also, a value of 0 for arch_compat generally refers the default
compatibility of the host. But, arch_compat, being a Guest Wide Element
in nested API v2, cannot be set to 0 in GSB as PowerVM (L0) expects a
non-zero value. A value of 0 triggers a kernel trap during a reboot and
consequently causes it to fail:

[   22.106360] reboot: Restarting system
KVM: unknown exit, hardware reason ffea
NIP 0100   LR fe44 CTR  XER 
20040092 CPU#0
MSR 1000 HID0   HF 6c00 iidx 3 didx 3
TB   DECR 0
GPR00   c2a8c300 7fe0
GPR04   1002 82803033
GPR08 0a00  0004 2fff
GPR12  c2e1 000105639200 0004
GPR16  00010563a090  
GPR20 000105639e20 0001056399c8 7fffe54abab0 000105639288
GPR24  0001 0001 
GPR28   c2b30840 
CR   [ -  -  -  -  -  -  -  -  ] RES 000@
 SRR0   SRR1 PVR 00800200 VRSAVE 

SPRG0  SPRG1   SPRG2   SPRG3 

SPRG4  SPRG5   SPRG6   SPRG7 

HSRR0  HSRR1 
 CFAR 
 LPCR 00020400
 PTCR    DAR   DSISR 

 kernel:trap=0xffea | pc=0x100 | msr=0x1000

This patch updates kvmppc_set_arch_compat() to use the host PVR value if
'compat_pvr' == 0 indicating that qemu doesn't want to enforce any
specific PVR compat mode.

The relevant part of the code might need a rework if PowerVM implements
a support for `arch_compat == 0` in nestedv2 API.

Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests")
Reviewed-by: Aneesh Kumar K.V (IBM) 
Reviewed-by: Vaibhav Jain 
Signed-off-by: Amit Machhiwal 
---

Changes v3 -> v4:
- Moved part of a code comment around implementation of `arch_compat
  == 0` in PowerVM to the patch description based on an off mailing
  list discussion

Changes v2 -> v3:
- Vaibhav: Use a 'break' instead of switch-case fallthrough
- v2: 
https://lore.kernel.org/all/20240205132607.2776637-1-amach...@linux.ibm.com/

Changes v1 -> v2:
- Added descriptive error log in the patch description when
  `arch_compat == 0` passed in GSB
- Added a helper function for PCR to capabilities mapping
- Added relevant comments around the changes being made
- v1: 
https://lore.kernel.org/lkml/20240118095653.2588129-1-amach...@linux.ibm.com/

 arch/powerpc/kvm/book3s_hv.c  | 26 --
 arch/powerpc/kvm/book3s_hv_nestedv2.c | 20 ++--
 2 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 52427fc2a33f..0b921704da45 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -391,6 +391,24 @@ static void kvmppc_set_pvr_hv(struct kvm_vcpu *vcpu, u32 
pvr)
 /* Dummy value used in computing PCR value below */
 #define PCR_ARCH_31(PCR_ARCH_300 << 1)
 
+static inline unsigned long map_pcr_to_cap(unsigned long pcr)
+{
+   unsigned long cap = 0;
+
+   switch (pcr) {
+   case PCR_ARCH_300:
+   cap = H_GUEST_CAP_POWER9;
+   break;
+   case PCR_ARCH_31:
+   cap = H_GUEST_CAP_POWER10;
+   break;
+   default:
+   break;
+   }
+
+   return cap;
+}
+
 static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 {
unsigned long host_pcr_bit = 0, guest_pcr_bit = 0, cap = 0;
@@ -424,11 +442,9 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
u32 arch_compat)
break;
case PVR_ARCH_300:
guest_pcr_bit = PCR_ARCH_300;
-   cap = H_GUEST_CAP_POWER9;
break;
case PVR_ARCH_31:
guest_pcr_bit = PCR_ARCH_31;
-   cap = H_GUEST_CAP_POWER10;
break;
default:
return -EINVAL;
@@ -440,6 +456,12 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
u32 arch_compat)
  

Re: Re: [PATCH v3] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-02-06 Thread Amit Machhiwal
Hi Michael,

Thanks for looking into the patch and your comments.

On 2024/02/06 09:09 PM, Michael Ellerman wrote:
> Hi Amit,
> 
> One comment below ...
> 
> Amit Machhiwal  writes:
> > Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
> > below error as L1 qemu sends PVR value 'arch_compat' == 0 via
> > ppc_set_compat ioctl. This triggers a condition failure in
> > kvmppc_set_arch_compat() resulting in an EINVAL.
> ...
> > 
> > diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2.c 
> > b/arch/powerpc/kvm/book3s_hv_nestedv2.c
> > index 5378eb40b162..6042bdc70230 100644
> > --- a/arch/powerpc/kvm/book3s_hv_nestedv2.c
> > +++ b/arch/powerpc/kvm/book3s_hv_nestedv2.c
> > @@ -347,8 +348,26 @@ static int gs_msg_ops_vcpu_fill_info(struct 
> > kvmppc_gs_buff *gsb,
> > break;
> > }
> > case KVMPPC_GSID_LOGICAL_PVR:
> > -   rc = kvmppc_gse_put_u32(gsb, iden,
> > -   vcpu->arch.vcore->arch_compat);
> > +   /*
> > +* Though 'arch_compat == 0' would mean the default
> > +* compatibility, arch_compat, being a Guest Wide
> > +* Element, cannot be filled with a value of 0 in GSB
> > +* as this would result into a kernel trap.
> > +* Hence, when `arch_compat == 0`, arch_compat should
> > +* default to L1's PVR.
> > +*
> > +* Rework this when PowerVM supports a value of 0
> > +* for arch_compat for KVM API v2.
> > +*/
> 
> Is there an actual plan that PowerVM will support this in future?
> 
> If so, how will a future kernel know that it's running on a version of
> PowerVM that does support arch_compat == 0?
> 
> Similarly how will we know when it's OK to drop support for this
> workaround?

I'm sending a v4 based on an off mailing list discussion.

> 
> cheers

~Amit


[PATCH v3] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-02-05 Thread Amit Machhiwal
Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
below error as L1 qemu sends PVR value 'arch_compat' == 0 via
ppc_set_compat ioctl. This triggers a condition failure in
kvmppc_set_arch_compat() resulting in an EINVAL.

qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid
argument

Also, a value of 0 for arch_compat generally refers the default
compatibility of the host. But, arch_compat, being a Guest Wide Element
in nested API v2, cannot be set to 0 in GSB as PowerVM (L0) expects a
non-zero value. A value of 0 triggers a kernel trap during a reboot and
consequently causes it to fail:

[   22.106360] reboot: Restarting system
KVM: unknown exit, hardware reason ffea
NIP 0100   LR fe44 CTR  XER 
20040092 CPU#0
MSR 1000 HID0   HF 6c00 iidx 3 didx 3
TB   DECR 0
GPR00   c2a8c300 7fe0
GPR04   1002 82803033
GPR08 0a00  0004 2fff
GPR12  c2e1 000105639200 0004
GPR16  00010563a090  
GPR20 000105639e20 0001056399c8 7fffe54abab0 000105639288
GPR24  0001 0001 
GPR28   c2b30840 
CR   [ -  -  -  -  -  -  -  -  ] RES 000@
 SRR0   SRR1 PVR 00800200 VRSAVE 

SPRG0  SPRG1   SPRG2   SPRG3 

SPRG4  SPRG5   SPRG6   SPRG7 

HSRR0  HSRR1 
 CFAR 
 LPCR 00020400
 PTCR    DAR   DSISR 

 kernel:trap=0xffea | pc=0x100 | msr=0x1000

This patch updates kvmppc_set_arch_compat() to use the host PVR value if
'compat_pvr' == 0 indicating that qemu doesn't want to enforce any
specific PVR compat mode.

Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests")
Reviewed-by: Aneesh Kumar K.V (IBM) 
Reviewed-by: Vaibhav Jain 
Signed-off-by: Amit Machhiwal 
---

Changes v2 -> v3:
- Vaibhav: Use a 'break' instead of switch-case fallthrough
- v2: 
https://lore.kernel.org/all/20240205132607.2776637-1-amach...@linux.ibm.com/

Changes v1 -> v2:
- Added descriptive error log in the patch description when
  `arch_compat == 0` passed in GSB
- Added a helper function for PCR to capabilities mapping
- Added relevant comments around the changes being made
- v1: 
https://lore.kernel.org/lkml/20240118095653.2588129-1-amach...@linux.ibm.com/

 arch/powerpc/kvm/book3s_hv.c  | 26 --
 arch/powerpc/kvm/book3s_hv_nestedv2.c | 23 +--
 2 files changed, 45 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 52427fc2a33f..0b921704da45 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -391,6 +391,24 @@ static void kvmppc_set_pvr_hv(struct kvm_vcpu *vcpu, u32 
pvr)
 /* Dummy value used in computing PCR value below */
 #define PCR_ARCH_31(PCR_ARCH_300 << 1)
 
+static inline unsigned long map_pcr_to_cap(unsigned long pcr)
+{
+   unsigned long cap = 0;
+
+   switch (pcr) {
+   case PCR_ARCH_300:
+   cap = H_GUEST_CAP_POWER9;
+   break;
+   case PCR_ARCH_31:
+   cap = H_GUEST_CAP_POWER10;
+   break;
+   default:
+   break;
+   }
+
+   return cap;
+}
+
 static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 {
unsigned long host_pcr_bit = 0, guest_pcr_bit = 0, cap = 0;
@@ -424,11 +442,9 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
u32 arch_compat)
break;
case PVR_ARCH_300:
guest_pcr_bit = PCR_ARCH_300;
-   cap = H_GUEST_CAP_POWER9;
break;
case PVR_ARCH_31:
guest_pcr_bit = PCR_ARCH_31;
-   cap = H_GUEST_CAP_POWER10;
break;
default:
return -EINVAL;
@@ -440,6 +456,12 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
u32 arch_compat)
return -EINVAL;
 
if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
+   /*
+* 'arch_compat == 0' would mean the guest should default to
+* L1's compatibility. In this case, the guest woul

Re: Re: [PATCH v2] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-02-05 Thread Amit Machhiwal
Hi Vaibhav,

Thanks for looking into the patch.

On 2024/02/05 11:05 PM, Vaibhav Jain wrote:
> Hi Amit,
> 
> Thanks for the patch. Minor comment on the patch below:
> 
> Amit Machhiwal  writes:
> 
> 
> 
> >  
> > +static inline unsigned long map_pcr_to_cap(unsigned long pcr)
> > +{
> > +   unsigned long cap = 0;
> > +
> > +   switch (pcr) {
> > +   case PCR_ARCH_300:
> > +   cap = H_GUEST_CAP_POWER9;
> > +   break;
> > +   case PCR_ARCH_31:
> > +   cap = H_GUEST_CAP_POWER10;
> Though CONFIG_CC_IMPLICIT_FALLTHROUGH and '-Wimplicit-fallthrough'
> doesnt explicitly flag this usage, please consider using the
> 'fallthrough;' keyword here.
> 
> However you probably dont want this switch-case to fallthrough so please
> use a 'break' instead.

Sure, v3 on the way.

> 
> > +   default:
> > +   break;
> > +   }
> > +
> > +   return cap;
> > +}
> > +
> >
> 
> 
> With the suggested change above
> 
> Reviewed-by: Vaibhav Jain 

Thanks!

> 
> -- 
> Cheers
> ~ Vaibhav

~Amit


[PATCH v2] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-02-05 Thread Amit Machhiwal
Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
below error as L1 qemu sends PVR value 'arch_compat' == 0 via
ppc_set_compat ioctl. This triggers a condition failure in
kvmppc_set_arch_compat() resulting in an EINVAL.

qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid
argument

Also, a value of 0 for arch_compat generally refers the default
compatibility of the host. But, arch_compat, being a Guest Wide Element
in nested API v2, cannot be set to 0 in GSB as PowerVM (L0) expects a
non-zero value. A value of 0 triggers a kernel trap during a reboot and
consequently causes it to fail:

[   22.106360] reboot: Restarting system
KVM: unknown exit, hardware reason ffea
NIP 0100   LR fe44 CTR  XER 
20040092 CPU#0
MSR 1000 HID0   HF 6c00 iidx 3 didx 3
TB   DECR 0
GPR00   c2a8c300 7fe0
GPR04   1002 82803033
GPR08 0a00  0004 2fff
GPR12  c2e1 000105639200 0004
GPR16  00010563a090  
GPR20 000105639e20 0001056399c8 7fffe54abab0 000105639288
GPR24  0001 0001 
GPR28   c2b30840 
CR   [ -  -  -  -  -  -  -  -  ] RES 000@
 SRR0   SRR1 PVR 00800200 VRSAVE 

SPRG0  SPRG1   SPRG2   SPRG3 

SPRG4  SPRG5   SPRG6   SPRG7 

HSRR0  HSRR1 
 CFAR 
 LPCR 00020400
 PTCR    DAR   DSISR 

 kernel:trap=0xffea | pc=0x100 | msr=0x1000

This patch updates kvmppc_set_arch_compat() to use the host PVR value if
'compat_pvr' == 0 indicating that qemu doesn't want to enforce any
specific PVR compat mode.

Fixes: 19d31c5f1157 ("KVM: PPC: Add support for nestedv2 guests")
Signed-off-by: Amit Machhiwal 
---

Changes v1 -> v2:
- Added descriptive error log in the patch description when
  `arch_compat == 0` passed in GSB
- Added a helper function for PCR to capabilities mapping
- Added relevant comments around the changes being made

v1: 
https://lore.kernel.org/lkml/20240118095653.2588129-1-amach...@linux.ibm.com/

 arch/powerpc/kvm/book3s_hv.c  | 25 +++--
 arch/powerpc/kvm/book3s_hv_nestedv2.c | 23 +--
 2 files changed, 44 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 52427fc2a33f..270ab9cf9a54 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -391,6 +391,23 @@ static void kvmppc_set_pvr_hv(struct kvm_vcpu *vcpu, u32 
pvr)
 /* Dummy value used in computing PCR value below */
 #define PCR_ARCH_31(PCR_ARCH_300 << 1)
 
+static inline unsigned long map_pcr_to_cap(unsigned long pcr)
+{
+   unsigned long cap = 0;
+
+   switch (pcr) {
+   case PCR_ARCH_300:
+   cap = H_GUEST_CAP_POWER9;
+   break;
+   case PCR_ARCH_31:
+   cap = H_GUEST_CAP_POWER10;
+   default:
+   break;
+   }
+
+   return cap;
+}
+
 static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 {
unsigned long host_pcr_bit = 0, guest_pcr_bit = 0, cap = 0;
@@ -424,11 +441,9 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
u32 arch_compat)
break;
case PVR_ARCH_300:
guest_pcr_bit = PCR_ARCH_300;
-   cap = H_GUEST_CAP_POWER9;
break;
case PVR_ARCH_31:
guest_pcr_bit = PCR_ARCH_31;
-   cap = H_GUEST_CAP_POWER10;
break;
default:
return -EINVAL;
@@ -440,6 +455,12 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
u32 arch_compat)
return -EINVAL;
 
if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
+   /*
+* 'arch_compat == 0' would mean the guest should default to
+* L1's compatibility. In this case, the guest would pick
+* host's PCR and evaluate the corresponding capabilities.
+*/
+   cap = map_pcr_to_cap(guest_pcr_bit);
if (!(cap & nested_capabilities))
return -EINVAL;
}
diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2

Re: Re: [PATCH] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-01-29 Thread Amit Machhiwal
Hi Aneesh,

Thanks for looking into the patch. My comments are inline below.

On 2024/01/24 01:06 PM, Aneesh Kumar K.V wrote:
> Amit Machhiwal  writes:
> 
> > Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
> > below error as L1 qemu sends PVR value 'arch_compat' == 0 via
> > ppc_set_compat ioctl. This triggers a condition failure in
> > kvmppc_set_arch_compat() resulting in an EINVAL.
> >
> > qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid
> >
> > This patch updates kvmppc_set_arch_compat() to use the host PVR value if
> > 'compat_pvr' == 0 indicating that qemu doesn't want to enforce any
> > specific PVR compat mode.
> >
> > Signed-off-by: Amit Machhiwal 
> > ---
> >  arch/powerpc/kvm/book3s_hv.c  |  2 +-
> >  arch/powerpc/kvm/book3s_hv_nestedv2.c | 12 ++--
> >  2 files changed, 11 insertions(+), 3 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 1ed6ec140701..9573d7f4764a 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -439,7 +439,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu 
> > *vcpu, u32 arch_compat)
> > if (guest_pcr_bit > host_pcr_bit)
> > return -EINVAL;
> >  
> > -   if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
> > +   if (kvmhv_on_pseries() && kvmhv_is_nestedv2() && arch_compat) {
> > if (!(cap & nested_capabilities))
> > return -EINVAL;
> > }
> >
> 
> Instead of that arch_compat check, would it better to do
> 
>   if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
>   if (cap && !(cap & nested_capabilities))
>   return -EINVAL;
>   }
> 
> ie, if a capability is requested, then check against nested_capbilites
> to see if the capability exist.

The above condition check will cause problems when we would try to boot
a machine below Power 9.

For example, if we passed the arch_compat == PVR_ARCH_207, cap will
remain 0 resulting the above check into a false condition. Consequently,
we would never return an -EINVAL in that case resulting the arch
compatilbility request succeed when it doesn't support nested papr
guest.

> 
> 
> > diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2.c 
> > b/arch/powerpc/kvm/book3s_hv_nestedv2.c
> > index fd3c4f2d9480..069a1fcfd782 100644
> > --- a/arch/powerpc/kvm/book3s_hv_nestedv2.c
> > +++ b/arch/powerpc/kvm/book3s_hv_nestedv2.c
> > @@ -138,6 +138,7 @@ static int gs_msg_ops_vcpu_fill_info(struct 
> > kvmppc_gs_buff *gsb,
> > vector128 v;
> > int rc, i;
> > u16 iden;
> > +   u32 arch_compat = 0;
> >  
> > vcpu = gsm->data;
> >  
> > @@ -347,8 +348,15 @@ static int gs_msg_ops_vcpu_fill_info(struct 
> > kvmppc_gs_buff *gsb,
> > break;
> > }
> > case KVMPPC_GSID_LOGICAL_PVR:
> > -   rc = kvmppc_gse_put_u32(gsb, iden,
> > -   vcpu->arch.vcore->arch_compat);
> > +   if (!vcpu->arch.vcore->arch_compat) {
> > +   if (cpu_has_feature(CPU_FTR_ARCH_31))
> > +   arch_compat = PVR_ARCH_31;
> > +   else if (cpu_has_feature(CPU_FTR_ARCH_300))
> > +   arch_compat = PVR_ARCH_300;
> > +   } else {
> > +   arch_compat = vcpu->arch.vcore->arch_compat;
> > +   }
> > +   rc = kvmppc_gse_put_u32(gsb, iden, arch_compat);
> >
> 
> Won't a arch_compat = 0 work here?. ie, where you observing the -EINVAL from
> the first hunk or does this hunk have an impact? 
>

No, an arch_compat == 0 won't work in nested API v2. That's because the
guest wide PVR cannot be 0, and if arch_compat == 0, then suppported
host PVR value should be mentioned.

If we were to skip this hunk (keeping the arch_compat == 0), a system
reboot of L2 guest would fail and result into a kernel trap as below:

[   22.106360] reboot: Restarting system
KVM: unknown exit, hardware reason ffea
NIP 0100   LR fe44 CTR  XER 
20040092 CPU#0
MSR 1000 HID0   HF 6c00 iidx 3 didx 3
TB   DECR 0
GPR00   c2a8c300 7fe0
GPR04   1002 8000

[PATCH] KVM: PPC: Book3S HV: Fix L2 guest reboot failure due to empty 'arch_compat'

2024-01-18 Thread Amit Machhiwal
Currently, rebooting a pseries nested qemu-kvm guest (L2) results in
below error as L1 qemu sends PVR value 'arch_compat' == 0 via
ppc_set_compat ioctl. This triggers a condition failure in
kvmppc_set_arch_compat() resulting in an EINVAL.

qemu-system-ppc64: Unable to set CPU compatibility mode in KVM: Invalid

This patch updates kvmppc_set_arch_compat() to use the host PVR value if
'compat_pvr' == 0 indicating that qemu doesn't want to enforce any
specific PVR compat mode.

Signed-off-by: Amit Machhiwal 
---
 arch/powerpc/kvm/book3s_hv.c  |  2 +-
 arch/powerpc/kvm/book3s_hv_nestedv2.c | 12 ++--
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1ed6ec140701..9573d7f4764a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -439,7 +439,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, 
u32 arch_compat)
if (guest_pcr_bit > host_pcr_bit)
return -EINVAL;
 
-   if (kvmhv_on_pseries() && kvmhv_is_nestedv2()) {
+   if (kvmhv_on_pseries() && kvmhv_is_nestedv2() && arch_compat) {
if (!(cap & nested_capabilities))
return -EINVAL;
}
diff --git a/arch/powerpc/kvm/book3s_hv_nestedv2.c 
b/arch/powerpc/kvm/book3s_hv_nestedv2.c
index fd3c4f2d9480..069a1fcfd782 100644
--- a/arch/powerpc/kvm/book3s_hv_nestedv2.c
+++ b/arch/powerpc/kvm/book3s_hv_nestedv2.c
@@ -138,6 +138,7 @@ static int gs_msg_ops_vcpu_fill_info(struct kvmppc_gs_buff 
*gsb,
vector128 v;
int rc, i;
u16 iden;
+   u32 arch_compat = 0;
 
vcpu = gsm->data;
 
@@ -347,8 +348,15 @@ static int gs_msg_ops_vcpu_fill_info(struct kvmppc_gs_buff 
*gsb,
break;
}
case KVMPPC_GSID_LOGICAL_PVR:
-   rc = kvmppc_gse_put_u32(gsb, iden,
-   vcpu->arch.vcore->arch_compat);
+   if (!vcpu->arch.vcore->arch_compat) {
+   if (cpu_has_feature(CPU_FTR_ARCH_31))
+   arch_compat = PVR_ARCH_31;
+   else if (cpu_has_feature(CPU_FTR_ARCH_300))
+   arch_compat = PVR_ARCH_300;
+   } else {
+   arch_compat = vcpu->arch.vcore->arch_compat;
+   }
+   rc = kvmppc_gse_put_u32(gsb, iden, arch_compat);
break;
}
 
-- 
2.43.0