Re: [PATCH v3] x86/paravirt: Disable virt spinlock on bare metal

2024-06-25 Thread Nikolay Borisov




On 25.06.24 г. 15:54 ч., Chen Yu wrote:

The kernel can change spinlock behavior when running as a guest. But
this guest-friendly behavior causes performance problems on bare metal.
So there's a 'virt_spin_lock_key' static key to switch between the two
modes.

The static key is always enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior).

Performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
is disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which
decrease the performance on bare metal.

Set the default value of virt_spin_lock_key to false. If booting in a VM,
enable this key. Later during the VM initialization, if other
high-efficient spinlock is preferred(paravirt-spinlock eg), the
virt_spin_lock_key is disabled accordingly. The relation is described as
below:

X86_FEATURE_HYPERVISOR YYY N
CONFIG_PARAVIRT_SPINLOCKS  YYN Y/N
PV spinlockYNN Y/N

virt_spin_lock_key NNY N

Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function 
warning")
Suggested-by: Dave Hansen 
Suggested-by: Qiuxu Zhuo 
Suggested-by: Nikolay Borisov 
Reported-by: Prem Nath Dey 
Reported-by: Xiaoping Zhou 
Signed-off-by: Chen Yu 


Reviewed-by: Nikolay Borisov 



Re: [PATCH v2 0/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update device tree of Motorola Moto G 4G (2013)

2024-06-25 Thread Bjorn Andersson


On Mon, 17 Jun 2024 23:22:26 +0200, André Apitzsch wrote:
> Add accelerometer, magnetometer, regulator and temperature sensor alert
> interrupt and update framebuffer supplies.
> 
> 

Applied, thanks!

[1/3] ARM: dts: qcom: msm8926-motorola-peregrine: Add accelerometer, 
magnetometer, regulator
  commit: 65ec35baeb937e91970c5d88118c5638d8582bb3
[2/3] ARM: dts: qcom: msm8926-motorola-peregrine: Update temperature sensor
  commit: c9c86387ea1c4034fec34690c7cf2a96f9c21196
[3/3] ARM: dts: qcom: msm8926-motorola-peregrine: Add framebuffer supplies
  commit: fed1c79fc7fe10900d99a79a36e40443f3267ef3

Best regards,
-- 
Bjorn Andersson 



RE: [PATCH v3] x86/paravirt: Disable virt spinlock on bare metal

2024-06-25 Thread Zhuo, Qiuxu
> From: Nikolay Borisov 
> [...]
> >> Actually now shouldn't the CONFIG_PARAVIRT_SPINLOCKS check be
> retained?
> >> Otherwise we'll have the virtspinlock enabled even if we are a guest
> >> but CONFIG_PARAVIRT_SPINLOCKS is disabled, no ?
> >>
> >
> > It seems to be the expected behavior? If CONFIG_PARAVIRT_SPINLOCKS is
> > disabled, should the virt_spin_lock_key be enabled in the guest?
> 
> No, but if it's disabled and we are under a hypervisor shouldn't the virt
> spinlock be kept disabled? 

No, the virt_spin_lock_key shouldn't be kept disabled.

According to the comments [1], in the hypervisor if CONFIG_PARAVIRT_SPINLOCKS
is disabled,  the virt_spin_lock_key should be enabled to fall back to the TAS 
spinlock.

[1] 
https://github.com/torvalds/linux/blob/master/arch/x86/include/asm/qspinlock.h#L94

According to the comments [2]:
So my understanding is that in hypervisor keeping virt_spin_lock_key enabled 
allows
the spinlock fallback to TAS if PV spinlock is not supported (either 
CONFIG_PARAVIRT_SPINLOCKS=n
or the host doesn't support the PV feature)

[2] https://github.com/torvalds/linux/blob/master/arch/x86/kernel/kvm.c#L1073

> As it stands now everytime we are under a
> hypervisor the virt spinlock is enabled irrespective of the PARAVIRT_SPINLOCK
> config state.

According to [1] [2], yes, I think so, 

-Qiuxu 



Re: [PATCH V2 3/3] virtio-net: synchronize operstate with admin state on up/down

2024-06-25 Thread Jason Wang
On Tue, Jun 25, 2024 at 4:32 PM Michael S. Tsirkin  wrote:
>
> On Tue, Jun 25, 2024 at 04:11:05PM +0800, Jason Wang wrote:
> > On Tue, Jun 25, 2024 at 3:57 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Jun 25, 2024 at 03:46:44PM +0800, Jason Wang wrote:
> > > > Workqueue is used to serialize those so we won't lose any change.
> > >
> > > So we don't need to re-read then?
> > >
> >
> > We might have to re-read but I don't get why it is a problem for us.
> >
> > Thanks
>
> I don't think each ethtool command should force a full config read,
> is what I mean. Only do it if really needed.

We don't, as we will check config_pending there.

Thanks

>
> --
> MST
>




Re: [PATCH] virtio-pci: PCI extended capabilities for virtio

2024-06-25 Thread Jason Wang
On Wed, Jun 26, 2024 at 6:50 AM Michael S. Tsirkin  wrote:
>
> On Tue, Jun 25, 2024 at 02:39:37PM -0400, sahanlb wrote:
> > PCI legacy configuration space does not have sufficient space for a device
> > that supports all kinds of virtio structures via PCI capabilities. This is
> > especially true if one were to use virtio drivers with physical devices.
> > Link: https://par.nsf.gov/servlets/purl/10463939
> > A physical device may already have many capabilities in the legacy space.
> >
> > This patch adds support to place virtio capabilities in the PCI extended
> > configuration space and makes the driver search both legacy and extended
> > PCI configuration spaces.
> >
> > Add new argument to vp_modern_map_capability to indicate whether mapping
> > a legacy or extended capability.
> > Add new function virtio_pci_find_ext_capability to walk extended
> > capabilities and find virtio capabilities.
> >
> > Modify vp_modern_probe to search both legacy and extended configuration
> > spaces.
> > If virtio_pci_find_capability fails to find common, isr, notify, or device
> > virtio structures, call virtio_pci_find_ext_capability.
> >
> > Notify virtio structure can get mapped either in vp_modern_probe or in
> > vp_modern_map_vq_notify. Add new attribute 'notify_ecap' to
> > struct virtio_pci_modern_device to indicate whether the notify capability
> > is in the extended congiguration structure.
> >
> > Add virtio extended capability structures to
> > "include/uapi/linux/virtio_pci.h".
> > Format for the extended structures derived from
> > Link: 
> > https://lore.kernel.org/all/20220112055755.41011-2-jasow...@redhat.com/
> >
> > This patch has been validated using an FPGA development board to implement
> > a virtio interface.
> >
> > Signed-off-by: sahanlb 
>
>
> Thanks for the patch! As any UAPI change, this one has to also
> be accompanied by a spec patch documenting the capabilities.

Maybe start from one or my previous work like:

https://lists.oasis-open.org/archives/virtio-dev/202201/msg00013.html

Thanks




Re: [PATCH v2 4/7] bpf: support error injection static keys for multi_link attached progs

2024-06-25 Thread Alexei Starovoitov
On Wed, Jun 19, 2024 at 3:49 PM Vlastimil Babka  wrote:
>
> Functions marked for error injection can have an associated static key
> that guards the callsite(s) to avoid overhead of calling an empty
> function when no error injection is in progress.
>
> Outside of the error injection framework itself, bpf programs can be
> atteched to kprobes and override results of error-injectable functions.
> To make sure these functions are actually called, attaching such bpf
> programs should control the static key accordingly.
>
> Therefore, add an array of static keys to struct bpf_kprobe_multi_link
> and fill it in addrs_check_error_injection_list() for programs with
> kprobe_override enabled, using get_injection_key() instead of
> within_error_injection_list(). Introduce bpf_kprobe_ei_keys_control() to
> control the static keys and call the control function when doing
> multi_link_attach and release.
>
> Signed-off-by: Vlastimil Babka 
> ---
>  kernel/trace/bpf_trace.c | 59 
> +++-
>  1 file changed, 53 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 944de1c41209..ef0fadb76bfa 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -2613,6 +2613,7 @@ struct bpf_kprobe_multi_link {
> struct bpf_link link;
> struct fprobe fp;
> unsigned long *addrs;
> +   struct static_key **ei_keys;
> u64 *cookies;
> u32 cnt;
> u32 mods_cnt;
> @@ -2687,11 +2688,30 @@ static void free_user_syms(struct user_syms *us)
> kvfree(us->buf);
>  }
>
> +static void bpf_kprobe_ei_keys_control(struct bpf_kprobe_multi_link *link, 
> bool enable)
> +{
> +   u32 i;
> +
> +   for (i = 0; i < link->cnt; i++) {
> +   if (!link->ei_keys[i])
> +   break;
> +
> +   if (enable)
> +   static_key_slow_inc(link->ei_keys[i]);
> +   else
> +   static_key_slow_dec(link->ei_keys[i]);
> +   }
> +}
> +
>  static void bpf_kprobe_multi_link_release(struct bpf_link *link)
>  {
> struct bpf_kprobe_multi_link *kmulti_link;
>
> kmulti_link = container_of(link, struct bpf_kprobe_multi_link, link);
> +
> +   if (kmulti_link->ei_keys)
> +   bpf_kprobe_ei_keys_control(kmulti_link, false);
> +
> unregister_fprobe(&kmulti_link->fp);
> kprobe_multi_put_modules(kmulti_link->mods, kmulti_link->mods_cnt);
>  }
> @@ -2703,6 +2723,7 @@ static void bpf_kprobe_multi_link_dealloc(struct 
> bpf_link *link)
> kmulti_link = container_of(link, struct bpf_kprobe_multi_link, link);
> kvfree(kmulti_link->addrs);
> kvfree(kmulti_link->cookies);
> +   kvfree(kmulti_link->ei_keys);
> kfree(kmulti_link->mods);
> kfree(kmulti_link);
>  }
> @@ -2985,13 +3006,19 @@ static int get_modules_for_addrs(struct module 
> ***mods, unsigned long *addrs, u3
> return arr.mods_cnt;
>  }
>
> -static int addrs_check_error_injection_list(unsigned long *addrs, u32 cnt)
> +static int addrs_check_error_injection_list(unsigned long *addrs, struct 
> static_key **ei_keys,
> +   u32 cnt)
>  {
> -   u32 i;
> +   struct static_key *ei_key;
> +   u32 i, j = 0;
>
> for (i = 0; i < cnt; i++) {
> -   if (!within_error_injection_list(addrs[i]))
> +   ei_key = get_injection_key(addrs[i]);
> +   if (IS_ERR(ei_key))
> return -EINVAL;
> +
> +   if (ei_key)
> +   ei_keys[j++] = ei_key;
> }
> return 0;
>  }
> @@ -3000,6 +3027,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
> *attr, struct bpf_prog *pr
>  {
> struct bpf_kprobe_multi_link *link = NULL;
> struct bpf_link_primer link_primer;
> +   struct static_key **ei_keys = NULL;
> void __user *ucookies;
> unsigned long *addrs;
> u32 flags, cnt, size;
> @@ -3075,9 +3103,24 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
> *attr, struct bpf_prog *pr
> goto error;
> }
>
> -   if (prog->kprobe_override && addrs_check_error_injection_list(addrs, 
> cnt)) {
> -   err = -EINVAL;
> -   goto error;
> +   if (prog->kprobe_override) {
> +   ei_keys = kvcalloc(cnt, sizeof(*ei_keys), GFP_KERNEL);

cnt can be huge. Like tens of thousands.
while number of keys is tiny. two so far?
So most of the array will be wasted.

Jiri, Andrii,

please take a look as well.

> +   if (!ei_keys) {
> +   err = -ENOMEM;
> +   goto error;
> +   }
> +
> +   if (addrs_check_error_injection_list(addrs, ei_keys, cnt)) {
> +   err = -EINVAL;
> +   goto error;
> +   }
> +
> +   if (ei_keys[0

Re: [PATCH] virtio-pci: PCI extended capabilities for virtio

2024-06-25 Thread Michael S. Tsirkin
On Tue, Jun 25, 2024 at 02:39:37PM -0400, sahanlb wrote:
> PCI legacy configuration space does not have sufficient space for a device
> that supports all kinds of virtio structures via PCI capabilities. This is
> especially true if one were to use virtio drivers with physical devices.
> Link: https://par.nsf.gov/servlets/purl/10463939
> A physical device may already have many capabilities in the legacy space.
> 
> This patch adds support to place virtio capabilities in the PCI extended
> configuration space and makes the driver search both legacy and extended
> PCI configuration spaces.
> 
> Add new argument to vp_modern_map_capability to indicate whether mapping
> a legacy or extended capability.
> Add new function virtio_pci_find_ext_capability to walk extended
> capabilities and find virtio capabilities.
> 
> Modify vp_modern_probe to search both legacy and extended configuration
> spaces.
> If virtio_pci_find_capability fails to find common, isr, notify, or device
> virtio structures, call virtio_pci_find_ext_capability.
> 
> Notify virtio structure can get mapped either in vp_modern_probe or in
> vp_modern_map_vq_notify. Add new attribute 'notify_ecap' to
> struct virtio_pci_modern_device to indicate whether the notify capability
> is in the extended congiguration structure.
> 
> Add virtio extended capability structures to
> "include/uapi/linux/virtio_pci.h".
> Format for the extended structures derived from
> Link: https://lore.kernel.org/all/20220112055755.41011-2-jasow...@redhat.com/
> 
> This patch has been validated using an FPGA development board to implement 
> a virtio interface.
> 
> Signed-off-by: sahanlb 


Thanks for the patch! As any UAPI change, this one has to also
be accompanied by a spec patch documenting the capabilities.


...

> +struct virtio_pci_cfg_ecap {
> + struct virtio_pci_ecap cap;
> + __u8 pci_cfg_data[4]; /* Data for BAR access. */
> +};

Hmm, a weird thing to do. The reason we have it is because
a legacy bios has trouble accessing BAR directly (e.g. if BAR
is 64 bit).  Is there still an issue even for bios with
support for pcie extended config space?


> +
>  /* Macro versions of offsets for the Old Timers! */
>  #define VIRTIO_PCI_CAP_VNDR  0
>  #define VIRTIO_PCI_CAP_NEXT  1


This comes at an interesting time, there is an interest
in order to get virtio working over EP protocol,
in exposing all this information as part of a BAR
as opposed as a capability. I wonder whether that's
also palatable.


> -- 
> 2.42.0




Re: [PATCH] qdisc: fix NULL pointer dereference in perf_trace_qdisc_reset()

2024-06-25 Thread Steven Rostedt
On Fri, 21 Jun 2024 20:45:49 +0900
ysk...@gmail.com wrote:

> diff --git a/include/trace/events/qdisc.h b/include/trace/events/qdisc.h
> index f1b5e816e7e5..170b51fbe47a 100644
> --- a/include/trace/events/qdisc.h
> +++ b/include/trace/events/qdisc.h
> @@ -81,7 +81,7 @@ TRACE_EVENT(qdisc_reset,
>   TP_ARGS(q),
>  
>   TP_STRUCT__entry(
> - __string(   dev,qdisc_dev(q)->name  )
> + __string(dev, qdisc_dev(q) ? qdisc_dev(q)->name : "noop_queue")

From a tracing point of view:

Reviewed-by: Steven Rostedt (Google) 

-- Steve

>   __string(   kind,   q->ops->id  )
>   __field(u32,parent  )
>   __field(u32,handle  )



Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-25 Thread John Stultz
On Tue, Jun 25, 2024 at 2:48 PM David Woodhouse  wrote:
> On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> > On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > > From: David Woodhouse 
> > >
> > > The vmclock "device" provides a shared memory region with precision clock
> > > information. By using shared memory, it is safe across Live Migration.
> > >
> > > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > > actually helpful.
> > >
> > > The memory region of the device is also exposed to userspace so it can be
> > > read or memory mapped by application which need reliable notification of
> > > clock disruptions.
> >
> > There is effort underway to expose PTP clocks to user space via VDSO.
>
> Ooh, interesting. Got a reference to that please?
>
> >  Can we please not expose an ad hoc interface for that?
>
> Absolutely. I'm explicitly trying to intercept the virtio-rtc
> specification here, to *avoid* having to do anything ad hoc.
>
> Note that this is a "vDSO-style" interface from hypervisor to guest via
> a shared memory region, not necessarily an actual vDSO.
>
> But yes, it *is* intended to be exposed to userspace, so that userspace
> can know the *accurate* time without a system call, and know that it
> hasn't been perturbed by live migration.

Yea, I was going to raise a concern that just defining an mmaped
structure means it has to trust the guest logic is as expected. It's
good that it's versioned! :)

I'd fret a bit about exposing this to userland. It feels very similar
to the old powerpc systemcfg implementation that similarly mapped just
kernel data out to userland and was difficult to maintain as changes
were made. Would including a code page like a proper vdso make sense
to make this more flexible of an UABI to maintain?

thanks
-john



Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-25 Thread David Woodhouse
On Tue, 2024-06-25 at 23:34 +0200, Thomas Gleixner wrote:
> On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> > From: David Woodhouse 
> > 
> > The vmclock "device" provides a shared memory region with precision clock
> > information. By using shared memory, it is safe across Live Migration.
> > 
> > Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> > KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> > actually helpful.
> > 
> > The memory region of the device is also exposed to userspace so it can be
> > read or memory mapped by application which need reliable notification of
> > clock disruptions.
> 
> There is effort underway to expose PTP clocks to user space via VDSO.

Ooh, interesting. Got a reference to that please?

>  Can we please not expose an ad hoc interface for that?

Absolutely. I'm explicitly trying to intercept the virtio-rtc
specification here, to *avoid* having to do anything ad hoc.

Note that this is a "vDSO-style" interface from hypervisor to guest via
a shared memory region, not necessarily an actual vDSO.

But yes, it *is* intended to be exposed to userspace, so that userspace
can know the *accurate* time without a system call, and know that it
hasn't been perturbed by live migration.

> As you might have heard the sad news, I'm not feeling up to the task to
> dig deeper into this right now. Give me a couple of days to look at this
> with working brain.

I have not heard any news, although now I'm making inferences.

Wishing you the best!


smime.p7s
Description: S/MIME cryptographic signature


Re: [RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-25 Thread Thomas Gleixner
On Tue, Jun 25 2024 at 20:01, David Woodhouse wrote:
> From: David Woodhouse 
>
> The vmclock "device" provides a shared memory region with precision clock
> information. By using shared memory, it is safe across Live Migration.
>
> Like the KVM PTP clock, this can convert TSC-based cross timestamps into
> KVM clock values. Unlike the KVM PTP clock, it does so only when such is
> actually helpful.
>
> The memory region of the device is also exposed to userspace so it can be
> read or memory mapped by application which need reliable notification of
> clock disruptions.

There is effort underway to expose PTP clocks to user space via
VDSO. Can we please not expose an ad hoc interface for that?

As you might have heard the sad news, I'm not feeling up to the task to
dig deeper into this right now. Give me a couple of days to look at this
with working brain.

Thanks,

tglx



Re: [PATCH v3 3/4] remoteproc: qcom_q6v5_pas: Add support to attach a DSP

2024-06-25 Thread Bjorn Andersson
On Thu, Jun 20, 2024 at 05:31:42PM GMT, Komal Bajaj wrote:
> From: Melody Olvera 
> 
> Some chipsets will have DSPs which will have begun running prior
> to linux booting, so add support to late attach these DSPs by
> adding support for:
> - run-time checking of an offline or running DSP via rmb register
> - a late attach framework to attach to the running DSP
> - a handshake mechanism to ensure full and proper booting via rmb

I don't fully understand what state we're going to find the remote side
in during this attach.

I think that between commit message and comments in the code it should
be clear what the expected flow is, why there is a 5 second timeout, why
does Linux tell the DSP to "continue boot", etc.

Regards,
Bjorn

> 
> Signed-off-by: Melody Olvera 
> Signed-off-by: Komal Bajaj 
> ---
>  drivers/remoteproc/qcom_q6v5_pas.c | 102 +
>  1 file changed, 102 insertions(+)
> 
> diff --git a/drivers/remoteproc/qcom_q6v5_pas.c 
> b/drivers/remoteproc/qcom_q6v5_pas.c
> index b9759f6b2283..32d45c18e15e 100644
> --- a/drivers/remoteproc/qcom_q6v5_pas.c
> +++ b/drivers/remoteproc/qcom_q6v5_pas.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -258,6 +259,94 @@ static int adsp_load(struct rproc *rproc, const struct 
> firmware *fw)
>   return ret;
>  }
> 
> +static int adsp_signal_q6v5(struct qcom_adsp *adsp)
> +{
> + unsigned int val;
> + int ret;
> +
> + if (adsp->q6v5.rmb_base) {
> + ret = readl_poll_timeout(adsp->q6v5.rmb_base + 
> RMB_BOOT_WAIT_REG,
> +  val, val, 2,
> +  RMB_POLL_MAX_TIMES * 2);
> + if (ret < 0)
> + return ret;
> +
> + writel_relaxed(1, adsp->q6v5.rmb_base + RMB_BOOT_CONT_REG);
> + }
> +
> + return 0;
> +}
> +
> +static int adsp_attach(struct rproc *rproc)
> +{
> + struct qcom_adsp *adsp = (struct qcom_adsp *)rproc->priv;
> + int ret;
> +
> + ret = qcom_q6v5_prepare(&adsp->q6v5);
> + if (ret)
> + return ret;
> +
> + ret = adsp_pds_enable(adsp, adsp->proxy_pds, adsp->proxy_pd_count);
> + if (ret < 0)
> + goto disable_irqs;
> +
> + ret = clk_prepare_enable(adsp->xo);
> + if (ret)
> + goto disable_proxy_pds;
> +
> + ret = clk_prepare_enable(adsp->aggre2_clk);
> + if (ret)
> + goto disable_xo_clk;
> +
> + if (adsp->cx_supply) {
> + ret = regulator_enable(adsp->cx_supply);
> + if (ret)
> + goto disable_aggre2_clk;
> + }
> +
> + if (adsp->px_supply) {
> + ret = regulator_enable(adsp->px_supply);
> + if (ret)
> + goto disable_cx_supply;
> + }
> +
> + /* if needed, signal Q6 to continute booting */
> + ret = adsp_signal_q6v5(adsp);
> + if (ret < 0) {
> + dev_err(adsp->dev, "Didn't get rmb signal from  %s\n", 
> rproc->name);
> + goto disable_px_supply;
> + };
> +
> + ret = qcom_q6v5_wait_for_start(&adsp->q6v5, msecs_to_jiffies(5000));
> + if (ret == -ETIMEDOUT) {
> + dev_err(adsp->dev, "start timed out\n");
> + qcom_scm_pas_shutdown(adsp->pas_id);
> + goto disable_px_supply;
> + }
> +
> + return 0;
> +
> +disable_px_supply:
> + if (adsp->px_supply)
> + regulator_disable(adsp->px_supply);
> +disable_cx_supply:
> + if (adsp->cx_supply)
> + regulator_disable(adsp->cx_supply);
> +disable_aggre2_clk:
> + clk_disable_unprepare(adsp->aggre2_clk);
> +disable_xo_clk:
> + clk_disable_unprepare(adsp->xo);
> +disable_proxy_pds:
> + adsp_pds_disable(adsp, adsp->proxy_pds, adsp->proxy_pd_count);
> +disable_irqs:
> + qcom_q6v5_unprepare(&adsp->q6v5);
> +
> + /* Remove pointer to the loaded firmware, only valid in adsp_load() & 
> adsp_start() */
> + adsp->firmware = NULL;
> +
> + return ret;
> +}
> +
>  static int adsp_start(struct rproc *rproc)
>  {
>   struct qcom_adsp *adsp = rproc->priv;
> @@ -320,6 +409,13 @@ static int adsp_start(struct rproc *rproc)
>   goto release_pas_metadata;
>   }
> 
> + /* if needed, signal Q6 to continute booting */
> + ret = adsp_signal_q6v5(adsp);
> + if (ret < 0) {
> + dev_err(adsp->dev, "Didn't get rmb signal from  %s\n", 
> rproc->name);
> + goto release_pas_metadata;
> + }
> +
>   ret = qcom_q6v5_wait_for_start(&adsp->q6v5, msecs_to_jiffies(5000));
>   if (ret == -ETIMEDOUT) {
>   dev_err(adsp->dev, "start timed out\n");
> @@ -432,6 +528,7 @@ static unsigned long adsp_panic(struct rproc *rproc)
>  static const struct rproc_ops adsp_ops = {
>   .unprepare = adsp_unprepare,
>   .start = adsp_start,
> + .attach = adsp_attach,
>   .stop = adsp_stop,
>   .da_to_va = adsp_da_to_va,
>   

[PATCH v2 4/4] EDAC/mce_amd: Add support for FRU Text in MCA

2024-06-25 Thread Avadhut Naik
From: Yazen Ghannam 

A new "FRU Text in MCA" feature is defined where the Field Replaceable
Unit (FRU) Text for a device is represented by a string in the new
MCA_SYND1 and MCA_SYND2 registers. This feature is supported per MCA
bank, and it is advertised by the McaFruTextInMca bit (MCA_CONFIG[9]).

The FRU Text is populated dynamically for each individual error state
(MCA_STATUS, MCA_ADDR, et al.). This handles the case where an MCA bank
covers multiple devices, for example, a Unified Memory Controller (UMC)
bank that manages two DIMMs.

Print the FRU Text string, if available, when decoding an MCA error.

Also, add field for MCA_CONFIG MSR in struct mce_hw_err as vendor specific
error information and save the value of the MSR. The very value can then be
exported through tracepoint for userspace tools like rasdaemon to print FRU
Text, if available.

 Note: Checkpatch checks/warnings are ignored to maintain coding style.

[Yazen: Add Avadhut as co-developer for wrapper changes. ]

Co-developed-by: Avadhut Naik 
Signed-off-by: Avadhut Naik 
Signed-off-by: Yazen Ghannam 
---
 arch/x86/include/asm/mce.h |  2 ++
 arch/x86/kernel/cpu/mce/amd.c  |  1 +
 arch/x86/kernel/cpu/mce/apei.c |  2 ++
 arch/x86/kernel/cpu/mce/core.c |  3 +++
 drivers/edac/mce_amd.c | 21 ++---
 5 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 2b43ba37bbda..c6dea9c12498 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -61,6 +61,7 @@
  *  - TCC bit is present in MCx_STATUS.
  */
 #define MCI_CONFIG_MCAX0x1
+#define MCI_CONFIG_FRUTEXT BIT_ULL(9)
 #define MCI_IPID_MCATYPE   0x
 #define MCI_IPID_HWID  0xFFF
 
@@ -199,6 +200,7 @@ struct mce_hw_err {
struct {
u64 synd1;
u64 synd2;
+   u64 config;
} amd;
} vi;
 };
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index fc69d244ca7f..f690905aa04f 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -798,6 +798,7 @@ static void __log_error(unsigned int bank, u64 status, u64 
addr, u64 misc)
 
if (mce_flags.smca) {
rdmsrl(MSR_AMD64_SMCA_MCx_IPID(bank), m->ipid);
+   rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(bank), err.vi.amd.config);
 
if (m->status & MCI_STATUS_SYNDV) {
rdmsrl(MSR_AMD64_SMCA_MCx_SYND(bank), m->synd);
diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 7a15f0ca1bd1..ba8947983dc7 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -157,6 +157,8 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx 
*ctx_info, u64 lapic_id)
fallthrough;
/* MCA_CONFIG */
case 4:
+   err.vi.amd.config = *(i_mce + 3);
+   fallthrough;
/* MCA_MISC0 */
case 3:
m->misc = *(i_mce + 2);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 3bb0f8b39f97..cbd10e499a28 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -195,6 +195,8 @@ static void __print_mce(struct mce_hw_err *err)
pr_cont("SYND2 %llx ", err->vi.amd.synd2);
if (m->ipid)
pr_cont("IPID %llx ", m->ipid);
+   if (err->vi.amd.config)
+   pr_cont("CONFIG %llx ", err->vi.amd.config);
}
 
pr_cont("\n");
@@ -667,6 +669,7 @@ static noinstr void mce_read_aux(struct mce_hw_err *err, 
int i)
 
if (mce_flags.smca) {
m->ipid = mce_rdmsrl(MSR_AMD64_SMCA_MCx_IPID(i));
+   err->vi.amd.config = mce_rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(i));
 
if (m->status & MCI_STATUS_SYNDV) {
m->synd = mce_rdmsrl(MSR_AMD64_SMCA_MCx_SYND(i));
diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 69e12cb2f0de..6ae6b89b1a1e 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -795,6 +795,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long 
val, void *data)
struct mce_hw_err *err = (struct mce_hw_err *)data;
struct mce *m = &err->m;
unsigned int fam = x86_family(m->cpuid);
+   u64 mca_config = err->vi.amd.config;
int ecc;
 
if (m->kflags & MCE_HANDLED_CEC)
@@ -814,11 +815,7 @@ amd_decode_mce(struct notifier_block *nb, unsigned long 
val, void *data)
((m->status & MCI_STATUS_PCC)   ? "PCC"   : "-"));
 
if (boot_cpu_has(X86_FEATURE_SMCA)) {
-   u32 low, high;
-   u32 addr = MSR_AMD64_SMCA_MCx_CONFIG(m->bank);
-
-   if (!rdmsr_safe(addr, &low, &high) &&
-   (low & MCI_CONFIG_MCAX))
+   if (mca_config & MCI_CONFIG_MCAX)
   

[PATCH v2 3/4] x86/mce/apei: Handle variable register array size

2024-06-25 Thread Avadhut Naik
From: Yazen Ghannam 

ACPI Boot Error Record Table (BERT) is being used by the kernel to
report errors that occurred in a previous boot. On some modern AMD
systems, these very errors within the BERT are reported through the
x86 Common Platform Error Record (CPER) format which consists of one
or more Processor Context Information Structures. These context
structures provide a starting address and represent an x86 MSR range
in which the data constitutes a contiguous set of MSRs starting from,
and including the starting address.

It's common, for AMD systems that implement this behavior, that the
MSR range represents the MCAX register space used for the Scalable MCA
feature. The apei_smca_report_x86_error() function decodes and passes
this information through the MCE notifier chain. However, this function
assumes a fixed register size based on the original HW/FW implementation.

This assumption breaks with the addition of two new MCAX registers viz.
MCA_SYND1 and MCA_SYND2. These registers are added at the end of the
MCAX register space, so they won't be included when decoding the CPER
data.

Rework apei_smca_report_x86_error() to support a variable register array
size. This covers any case where the MSR context information starts at
the MCAX address for MCA_STATUS and ends at any other register within
the MCAX register space.

Add code comments indicating the MCAX register at each offset.

[Yazen: Add Avadhut as co-developer for wrapper changes.]

Co-developed-by: Avadhut Naik 
Signed-off-by: Avadhut Naik 
Signed-off-by: Yazen Ghannam 
---
 arch/x86/kernel/cpu/mce/apei.c | 73 +++---
 1 file changed, 59 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index b8f4e75fb8a7..7a15f0ca1bd1 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -69,9 +69,9 @@ EXPORT_SYMBOL_GPL(apei_mce_report_mem_error);
 int apei_smca_report_x86_error(struct cper_ia_proc_ctx *ctx_info, u64 lapic_id)
 {
const u64 *i_mce = ((const u64 *) (ctx_info + 1));
+   unsigned int cpu, num_registers;
struct mce_hw_err err;
struct mce *m = &err.m;
-   unsigned int cpu;
 
memset(&err, 0, sizeof(struct mce_hw_err));
 
@@ -91,16 +91,12 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx 
*ctx_info, u64 lapic_id)
return -EINVAL;
 
/*
-* The register array size must be large enough to include all the
-* SMCA registers which need to be extracted.
-*
 * The number of registers in the register array is determined by
 * Register Array Size/8 as defined in UEFI spec v2.8, sec N.2.4.2.2.
-* The register layout is fixed and currently the raw data in the
-* register array includes 6 SMCA registers which the kernel can
-* extract.
+* Ensure that the array size includes at least 1 register.
 */
-   if (ctx_info->reg_arr_size < 48)
+   num_registers = ctx_info->reg_arr_size >> 3;
+   if (!num_registers)
return -EINVAL;
 
mce_setup(m);
@@ -118,12 +114,61 @@ int apei_smca_report_x86_error(struct cper_ia_proc_ctx 
*ctx_info, u64 lapic_id)
 
m->apicid = lapic_id;
m->bank = (ctx_info->msr_addr >> 4) & 0xFF;
-   m->status = *i_mce;
-   m->addr = *(i_mce + 1);
-   m->misc = *(i_mce + 2);
-   /* Skipping MCA_CONFIG */
-   m->ipid = *(i_mce + 4);
-   m->synd = *(i_mce + 5);
+
+   /*
+* The SMCA register layout is fixed and includes 16 registers.
+* The end of the array may be variable, but the beginning is known.
+* Switch on the number of registers. Cap the number of registers to
+* expected max (15).
+*/
+   if (num_registers > 15)
+   num_registers = 15;
+
+   switch (num_registers) {
+   /* MCA_SYND2 */
+   case 15:
+   err.vi.amd.synd2 = *(i_mce + 14);
+   fallthrough;
+   /* MCA_SYND1 */
+   case 14:
+   err.vi.amd.synd1 = *(i_mce + 13);
+   fallthrough;
+   /* MCA_MISC4 */
+   case 13:
+   /* MCA_MISC3 */
+   case 12:
+   /* MCA_MISC2 */
+   case 11:
+   /* MCA_MISC1 */
+   case 10:
+   /* MCA_DEADDR */
+   case 9:
+   /* MCA_DESTAT */
+   case 8:
+   /* reserved */
+   case 7:
+   /* MCA_SYND */
+   case 6:
+   m->synd = *(i_mce + 5);
+   fallthrough;
+   /* MCA_IPID */
+   case 5:
+   m->ipid = *(i_mce + 4);
+   fallthrough;
+   /* MCA_CONFIG */
+   case 4:
+   /* MCA_MISC0 */
+   case 3:
+   m->misc = *(i_mce + 2);
+   fallthrough;
+   /* MCA_ADDR */
+   case 2:
+   m->addr = *(i_mce + 1);
+   fallthrough;
+   /* MCA_STATUS */
+   case 1:
+   m->status = *i_mce;
+   }
 
mce_log

[PATCH v2 2/4] x86/mce, EDAC/mce_amd: Add support for new MCA_SYND{1,2} registers

2024-06-25 Thread Avadhut Naik
AMD's Scalable MCA systems viz. Genoa will include two new registers:
MCA_SYND1 and MCA_SYND2.

These registers will include supplemental error information in addition
to the existing MCA_SYND register. The data within the registers is
considered valid if MCA_STATUS[SyndV] is set.

Add fields for these registers as vendor-specific error information
in struct mce_hw_err. Save and print these registers wherever
MCA_STATUS[SyndV]/MCA_SYND is currently used.

Also, modify the mce_record tracepoint to export these new registers
through __dynamic_array. While the sizeof() operator has been used to
determine the size of this __dynamic_array, the same, if needed in the
future can be substituted by caching the size of vendor-specific error
information as part of struct mce_hw_err.

Note: Checkpatch warnings/errors are ignored to maintain coding style.

[Yazen: Drop Yazen's Co-developed-by tag and moved SoB tag.]

Signed-off-by: Avadhut Naik 
Signed-off-by: Yazen Ghannam 
---
 arch/x86/include/asm/mce.h | 12 
 arch/x86/kernel/cpu/mce/amd.c  |  5 -
 arch/x86/kernel/cpu/mce/core.c | 24 +---
 drivers/edac/mce_amd.c | 10 +++---
 include/trace/events/mce.h |  9 +++--
 5 files changed, 47 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index e955edb22897..2b43ba37bbda 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -122,6 +122,9 @@
 #define MSR_AMD64_SMCA_MC0_DESTAT  0xc0002008
 #define MSR_AMD64_SMCA_MC0_DEADDR  0xc0002009
 #define MSR_AMD64_SMCA_MC0_MISC1   0xc000200a
+/* Registers MISC2 to MISC4 are at offsets B to D. */
+#define MSR_AMD64_SMCA_MC0_SYND1   0xc000200e
+#define MSR_AMD64_SMCA_MC0_SYND2   0xc000200f
 #define MSR_AMD64_SMCA_MCx_CTL(x)  (MSR_AMD64_SMCA_MC0_CTL + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_STATUS(x)   (MSR_AMD64_SMCA_MC0_STATUS + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_ADDR(x) (MSR_AMD64_SMCA_MC0_ADDR + 0x10*(x))
@@ -132,6 +135,8 @@
 #define MSR_AMD64_SMCA_MCx_DESTAT(x)   (MSR_AMD64_SMCA_MC0_DESTAT + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_DEADDR(x)   (MSR_AMD64_SMCA_MC0_DEADDR + 0x10*(x))
 #define MSR_AMD64_SMCA_MCx_MISCy(x, y) ((MSR_AMD64_SMCA_MC0_MISC1 + y) + 
(0x10*(x)))
+#define MSR_AMD64_SMCA_MCx_SYND1(x)(MSR_AMD64_SMCA_MC0_SYND1 + 0x10*(x))
+#define MSR_AMD64_SMCA_MCx_SYND2(x)(MSR_AMD64_SMCA_MC0_SYND2 + 0x10*(x))
 
 #define XEC(x, mask)   (((x) >> 16) & mask)
 
@@ -189,6 +194,13 @@ enum mce_notifier_prios {
 
 struct mce_hw_err {
struct mce m;
+
+   union vendor_info {
+   struct {
+   u64 synd1;
+   u64 synd2;
+   } amd;
+   } vi;
 };
 
 struct notifier_block;
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index cb7dc0b1aa50..fc69d244ca7f 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -799,8 +799,11 @@ static void __log_error(unsigned int bank, u64 status, u64 
addr, u64 misc)
if (mce_flags.smca) {
rdmsrl(MSR_AMD64_SMCA_MCx_IPID(bank), m->ipid);
 
-   if (m->status & MCI_STATUS_SYNDV)
+   if (m->status & MCI_STATUS_SYNDV) {
rdmsrl(MSR_AMD64_SMCA_MCx_SYND(bank), m->synd);
+   rdmsrl(MSR_AMD64_SMCA_MCx_SYND1(bank), 
err.vi.amd.synd1);
+   rdmsrl(MSR_AMD64_SMCA_MCx_SYND2(bank), 
err.vi.amd.synd2);
+   }
}
 
mce_log(&err);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6225143b9b14..3bb0f8b39f97 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -189,6 +189,10 @@ static void __print_mce(struct mce_hw_err *err)
if (mce_flags.smca) {
if (m->synd)
pr_cont("SYND %llx ", m->synd);
+   if (err->vi.amd.synd1)
+   pr_cont("SYND1 %llx ", err->vi.amd.synd1);
+   if (err->vi.amd.synd2)
+   pr_cont("SYND2 %llx ", err->vi.amd.synd2);
if (m->ipid)
pr_cont("IPID %llx ", m->ipid);
}
@@ -639,8 +643,10 @@ static struct notifier_block mce_default_nb = {
 /*
  * Read ADDR and MISC registers.
  */
-static noinstr void mce_read_aux(struct mce *m, int i)
+static noinstr void mce_read_aux(struct mce_hw_err *err, int i)
 {
+   struct mce *m = &err->m;
+
if (m->status & MCI_STATUS_MISCV)
m->misc = mce_rdmsrl(mca_msr_reg(i, MCA_MISC));
 
@@ -662,8 +668,11 @@ static noinstr void mce_read_aux(struct mce *m, int i)
if (mce_flags.smca) {
m->ipid = mce_rdmsrl(MSR_AMD64_SMCA_MCx_IPID(i));
 
-   if (m->status & MCI_STATUS_SYNDV)
+   if (m->status & MCI_STATUS_SYNDV) {
m->synd = mce_rdmsrl(MSR_AMD64_SMCA_MCx_SYND(i));
+   

[PATCH v2 1/4] x86/mce: Add wrapper for struct mce to export vendor specific info

2024-06-25 Thread Avadhut Naik
Currently, exporting new additional machine check error information
involves adding new fields for the same at the end of the struct mce.
This additional information can then be consumed through mcelog or
tracepoint.

However, as new MSRs are being added (and will be added in the future)
by CPU vendors on their newer CPUs with additional machine check error
information to be exported, the size of struct mce will balloon on some
CPUs, unnecessarily, since those fields are vendor-specific. Moreover,
different CPU vendors may export the additional information in varying
sizes.

The problem particularly intensifies since struct mce is exposed to
userspace as part of UAPI. It's bloating through vendor-specific data
should be avoided to limit the information being sent out to userspace.

Add a new structure mce_hw_err to wrap the existing struct mce. The same
will prevent its ballooning since vendor-specifc data, if any, can now be
exported through a union within the wrapper structure and through
__dynamic_array in mce_record tracepoint.

Furthermore, new internal kernel fields can be added to the wrapper
struct without impacting the user space API.

Note: Some Checkpatch checks have been ignored to maintain coding style.

[Yazen: Add last commit message paragraph.]

Suggested-by: Borislav Petkov (AMD) 
Signed-off-by: Avadhut Naik 
Signed-off-by: Yazen Ghannam 
---
 arch/x86/include/asm/mce.h  |   6 +-
 arch/x86/kernel/cpu/mce/amd.c   |  29 ++--
 arch/x86/kernel/cpu/mce/apei.c  |  54 +++
 arch/x86/kernel/cpu/mce/core.c  | 178 +---
 arch/x86/kernel/cpu/mce/dev-mcelog.c|   2 +-
 arch/x86/kernel/cpu/mce/genpool.c   |  20 +--
 arch/x86/kernel/cpu/mce/inject.c|   4 +-
 arch/x86/kernel/cpu/mce/internal.h  |   4 +-
 drivers/acpi/acpi_extlog.c  |   2 +-
 drivers/acpi/nfit/mce.c |   2 +-
 drivers/edac/i7core_edac.c  |   2 +-
 drivers/edac/igen6_edac.c   |   2 +-
 drivers/edac/mce_amd.c  |   2 +-
 drivers/edac/pnd2_edac.c|   2 +-
 drivers/edac/sb_edac.c  |   2 +-
 drivers/edac/skx_common.c   |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |   2 +-
 drivers/ras/amd/fmpm.c  |   2 +-
 drivers/ras/cec.c   |   2 +-
 include/trace/events/mce.h  |  42 +++---
 20 files changed, 199 insertions(+), 162 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 3ad29b128943..e955edb22897 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -187,6 +187,10 @@ enum mce_notifier_prios {
MCE_PRIO_HIGHEST = MCE_PRIO_CEC
 };
 
+struct mce_hw_err {
+   struct mce m;
+};
+
 struct notifier_block;
 extern void mce_register_decode_chain(struct notifier_block *nb);
 extern void mce_unregister_decode_chain(struct notifier_block *nb);
@@ -222,7 +226,7 @@ static inline int apei_smca_report_x86_error(struct 
cper_ia_proc_ctx *ctx_info,
 #endif
 
 void mce_setup(struct mce *m);
-void mce_log(struct mce *m);
+void mce_log(struct mce_hw_err *err);
 DECLARE_PER_CPU(struct device *, mce_device);
 
 /* Maximum number of MCA banks per CPU. */
diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 9a0133ef7e20..cb7dc0b1aa50 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -778,29 +778,32 @@ bool amd_mce_usable_address(struct mce *m)
 
 static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
 {
-   struct mce m;
+   struct mce_hw_err err;
+   struct mce *m = &err.m;
 
-   mce_setup(&m);
+   memset(&err, 0, sizeof(struct mce_hw_err));
 
-   m.status = status;
-   m.misc   = misc;
-   m.bank   = bank;
-   m.tsc= rdtsc();
+   mce_setup(m);
 
-   if (m.status & MCI_STATUS_ADDRV) {
-   m.addr = addr;
+   m->status = status;
+   m->misc   = misc;
+   m->bank   = bank;
+   m->tsc   = rdtsc();
 
-   smca_extract_err_addr(&m);
+   if (m->status & MCI_STATUS_ADDRV) {
+   m->addr = addr;
+
+   smca_extract_err_addr(m);
}
 
if (mce_flags.smca) {
-   rdmsrl(MSR_AMD64_SMCA_MCx_IPID(bank), m.ipid);
+   rdmsrl(MSR_AMD64_SMCA_MCx_IPID(bank), m->ipid);
 
-   if (m.status & MCI_STATUS_SYNDV)
-   rdmsrl(MSR_AMD64_SMCA_MCx_SYND(bank), m.synd);
+   if (m->status & MCI_STATUS_SYNDV)
+   rdmsrl(MSR_AMD64_SMCA_MCx_SYND(bank), m->synd);
}
 
-   mce_log(&m);
+   mce_log(&err);
 }
 
 DEFINE_IDTENTRY_SYSVEC(sysvec_deferred_error)
diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 7f7309ff67d0..b8f4e75fb8a7 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -28,9 +28,12 @@
 
 void apei_mce_report_mem_error(int sev

[PATCH v2 0/4] MCE wrapper and support for new SMCA syndrome MSRs

2024-06-25 Thread Avadhut Naik
This patchset adds a new wrapper for struct mce to prevent its bloating
and export vendor specific error information. Additionally, support is
also introduced for two new "syndrome" MSRs used in newer AMD Scalable
MCA (SMCA) systems. Also, a new "FRU Text in MCA" feature that uses these
new "syndrome" MSRs has been addded.

Patch 1 adds the new wrapper structure mce_hw_err for the struct mce
while also modifying the mce_record tracepoint to use the new wrapper.

Patch 2 adds support for the new "syndrome" registers. They are read/printed
wherever the existing MCA_SYND register is used.

Patch 3 updates the function that pulls MCA information from UEFI x86
Common Platform Error Records (CPERs) to handle systems that support the
new registers.

Patch 4 adds support to the AMD MCE decoder module to detect and use the
"FRU Text in MCA" feature which leverages the new registers.

NOTE:

This set was initially submitted as part of the larger MCA Updates set.

v1: 
https://lore.kernel.org/linux-edac/20231118193248.1296798-1-yazen.ghan...@amd.com/
v2: 
https://lore.kernel.org/linux-edac/20240404151359.47970-1-yazen.ghan...@amd.com/

However, since the MCA Updates set has been split up into smaller sets,
this set, going forward, will be submitted independently.

Having said that, this set set depends on and applies cleanly on top of
the below two sets.

[1] 
https://lore.kernel.org/linux-edac/20240521125434.1555845-1-yazen.ghan...@amd.com/
[2] 
https://lore.kernel.org/linux-edac/20240523155641.2805411-1-yazen.ghan...@amd.com/

Changes in v2:
 - Drop dependencies on sets [1] and [2] above and rebase on top of
   tip/master. (Boris)

Avadhut Naik (2):
  x86/mce: Add wrapper for struct mce to export vendor specific info
  x86/mce, EDAC/mce_amd: Add support for new MCA_SYND{1,2} registers

Yazen Ghannam (2):
  x86/mce/apei: Handle variable register array size
  EDAC/mce_amd: Add support for FRU Text in MCA

 arch/x86/include/asm/mce.h  |  20 ++-
 arch/x86/kernel/cpu/mce/amd.c   |  33 ++--
 arch/x86/kernel/cpu/mce/apei.c  | 119 ++
 arch/x86/kernel/cpu/mce/core.c  | 201 ++--
 arch/x86/kernel/cpu/mce/dev-mcelog.c|   2 +-
 arch/x86/kernel/cpu/mce/genpool.c   |  20 +--
 arch/x86/kernel/cpu/mce/inject.c|   4 +-
 arch/x86/kernel/cpu/mce/internal.h  |   4 +-
 drivers/acpi/acpi_extlog.c  |   2 +-
 drivers/acpi/nfit/mce.c |   2 +-
 drivers/edac/i7core_edac.c  |   2 +-
 drivers/edac/igen6_edac.c   |   2 +-
 drivers/edac/mce_amd.c  |  27 +++-
 drivers/edac/pnd2_edac.c|   2 +-
 drivers/edac/sb_edac.c  |   2 +-
 drivers/edac/skx_common.c   |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c |   2 +-
 drivers/ras/amd/fmpm.c  |   2 +-
 drivers/ras/cec.c   |   2 +-
 include/trace/events/mce.h  |  51 +++---
 20 files changed, 316 insertions(+), 185 deletions(-)


base-commit: 4fe5c16f5e5e0bd1a71a5ac79b5870f91b6b8e81
-- 
2.34.1




[RFC PATCH v2] ptp: Add vDSO-style vmclock support

2024-06-25 Thread David Woodhouse
From: David Woodhouse 

The vmclock "device" provides a shared memory region with precision clock
information. By using shared memory, it is safe across Live Migration.

Like the KVM PTP clock, this can convert TSC-based cross timestamps into
KVM clock values. Unlike the KVM PTP clock, it does so only when such is
actually helpful.

The memory region of the device is also exposed to userspace so it can be
read or memory mapped by application which need reliable notification of
clock disruptions.

Signed-off-by: David Woodhouse 
---

v2: 
 • Add gettimex64() support
 • Convert TSC values to KVM clock when appropriate
 • Require int128 support
 • Add counter_period_shift 
 • Add timeout when seq_count is invalid
 • Add flags field
 • Better comments in vmclock ABI structure
 • Explicitly forbid smearing (as clock rates would need to change)

 drivers/ptp/Kconfig  |  13 +
 drivers/ptp/Makefile |   1 +
 drivers/ptp/ptp_vmclock.c| 516 +++
 include/uapi/linux/vmclock.h | 138 ++
 4 files changed, 668 insertions(+)
 create mode 100644 drivers/ptp/ptp_vmclock.c
 create mode 100644 include/uapi/linux/vmclock.h

diff --git a/drivers/ptp/Kconfig b/drivers/ptp/Kconfig
index 604541dcb320..e98c9767e0ef 100644
--- a/drivers/ptp/Kconfig
+++ b/drivers/ptp/Kconfig
@@ -131,6 +131,19 @@ config PTP_1588_CLOCK_KVM
  To compile this driver as a module, choose M here: the module
  will be called ptp_kvm.
 
+config PTP_1588_CLOCK_VMCLOCK
+   tristate "Virtual machine PTP clock"
+   depends on X86_TSC || ARM_ARCH_TIMER
+   depends on PTP_1588_CLOCK && ACPI && ARCH_SUPPORTS_INT128
+   default y
+   help
+ This driver adds support for using a virtual precision clock
+ advertised by the hypervisor. This clock is only useful in virtual
+ machines where such a device is present.
+
+ To compile this driver as a module, choose M here: the module
+ will be called ptp_vmclock.
+
 config PTP_1588_CLOCK_IDT82P33
tristate "IDT 82P33xxx PTP clock"
depends on PTP_1588_CLOCK && I2C
diff --git a/drivers/ptp/Makefile b/drivers/ptp/Makefile
index 68bf02078053..01b5cd91eb61 100644
--- a/drivers/ptp/Makefile
+++ b/drivers/ptp/Makefile
@@ -11,6 +11,7 @@ obj-$(CONFIG_PTP_1588_CLOCK_DTE)  += ptp_dte.o
 obj-$(CONFIG_PTP_1588_CLOCK_INES)  += ptp_ines.o
 obj-$(CONFIG_PTP_1588_CLOCK_PCH)   += ptp_pch.o
 obj-$(CONFIG_PTP_1588_CLOCK_KVM)   += ptp_kvm.o
+obj-$(CONFIG_PTP_1588_CLOCK_VMCLOCK)   += ptp_vmclock.o
 obj-$(CONFIG_PTP_1588_CLOCK_QORIQ) += ptp-qoriq.o
 ptp-qoriq-y+= ptp_qoriq.o
 ptp-qoriq-$(CONFIG_DEBUG_FS)   += ptp_qoriq_debugfs.o
diff --git a/drivers/ptp/ptp_vmclock.c b/drivers/ptp/ptp_vmclock.c
new file mode 100644
index ..e19c2eed8009
--- /dev/null
+++ b/drivers/ptp/ptp_vmclock.c
@@ -0,0 +1,516 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Virtual PTP 1588 clock for use with LM-safe VMclock device.
+ *
+ * Copyright © 2024 Amazon.com, Inc. or its affiliates.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#ifdef CONFIG_X86
+#include 
+#include 
+#endif
+
+static DEFINE_IDA(vmclock_ida);
+
+ACPI_MODULE_NAME("vmclock");
+
+struct vmclock_state {
+   phys_addr_t phys_addr;
+   struct vmclock_abi *clk;
+   struct miscdevice miscdev;
+   struct ptp_clock_info ptp_clock_info;
+   struct ptp_clock *ptp_clock;
+   enum clocksource_ids cs_id, sys_cs_id;
+   int index;
+   char *name;
+};
+
+#define VMCLOCK_MAX_WAIT ms_to_ktime(100)
+
+/*
+ * Multiply a 64-bit count by a 64-bit tick 'period' in units of seconds >> 64
+ * and add the fractional second part of the reference time.
+ *
+ * The result is a 128-bit value, the top 64 bits of which are seconds, and
+ * the low 64 bits are (seconds >> 64).
+ *
+ * If __int128 isn't available, perform the calculation 32 bits at a time to
+ * avoid overflow.
+ */
+static inline uint64_t mul_u64_u64_shr_add_u64(uint64_t *res_hi, uint64_t 
delta,
+  uint64_t period, uint8_t shift,
+  uint64_t frac_sec)
+{
+   unsigned __int128 res = (unsigned __int128)delta * period;
+
+   res >>= shift;
+   res += frac_sec;
+   *res_hi = res >> 64;
+   return (uint64_t)res;
+}
+
+static int vmclock_get_crosststamp(struct vmclock_state *st,
+  struct ptp_system_timestamp *sts,
+  struct system_counterval_t *system_counter,
+  struct timespec64 *tspec)
+{
+   ktime_t deadline = ktime_add(ktime_get(), VMCLOCK_MAX_WAIT);
+   struct system_time_snapshot systime_snapshot;
+   uint64_t cycle, delta, seq, frac_sec;
+
+#ifdef CONFIG_X86
+   /*
+* We'd expect the hypervisor to know t

[PATCH] virtio-pci: PCI extended capabilities for virtio

2024-06-25 Thread sahanlb
PCI legacy configuration space does not have sufficient space for a device
that supports all kinds of virtio structures via PCI capabilities. This is
especially true if one were to use virtio drivers with physical devices.
Link: https://par.nsf.gov/servlets/purl/10463939
A physical device may already have many capabilities in the legacy space.

This patch adds support to place virtio capabilities in the PCI extended
configuration space and makes the driver search both legacy and extended
PCI configuration spaces.

Add new argument to vp_modern_map_capability to indicate whether mapping
a legacy or extended capability.
Add new function virtio_pci_find_ext_capability to walk extended
capabilities and find virtio capabilities.

Modify vp_modern_probe to search both legacy and extended configuration
spaces.
If virtio_pci_find_capability fails to find common, isr, notify, or device
virtio structures, call virtio_pci_find_ext_capability.

Notify virtio structure can get mapped either in vp_modern_probe or in
vp_modern_map_vq_notify. Add new attribute 'notify_ecap' to
struct virtio_pci_modern_device to indicate whether the notify capability
is in the extended congiguration structure.

Add virtio extended capability structures to
"include/uapi/linux/virtio_pci.h".
Format for the extended structures derived from
Link: https://lore.kernel.org/all/20220112055755.41011-2-jasow...@redhat.com/

This patch has been validated using an FPGA development board to implement 
a virtio interface.

Signed-off-by: sahanlb 
---
 drivers/virtio/virtio_pci_modern_dev.c | 174 -
 include/linux/virtio_pci_modern.h  |   1 +
 include/uapi/linux/virtio_pci.h|  31 +
 3 files changed, 175 insertions(+), 31 deletions(-)

diff --git a/drivers/virtio/virtio_pci_modern_dev.c 
b/drivers/virtio/virtio_pci_modern_dev.c
index 0d3dbfaf4b23..75d0555edc7a 100644
--- a/drivers/virtio/virtio_pci_modern_dev.c
+++ b/drivers/virtio/virtio_pci_modern_dev.c
@@ -15,26 +15,41 @@
  * @size: map size
  * @len: the length that is actually mapped
  * @pa: physical address of the capability
+ * @ecap: capability is in the extended config space
  *
  * Returns the io address of for the part of the capability
  */
 static void __iomem *
 vp_modern_map_capability(struct virtio_pci_modern_device *mdev, int off,
 size_t minlen, u32 align, u32 start, u32 size,
-size_t *len, resource_size_t *pa)
+size_t *len, resource_size_t *pa, u8 ecap)
 {
struct pci_dev *dev = mdev->pci_dev;
u8 bar;
u32 offset, length;
void __iomem *p;
 
-   pci_read_config_byte(dev, off + offsetof(struct virtio_pci_cap,
-bar),
-&bar);
-   pci_read_config_dword(dev, off + offsetof(struct virtio_pci_cap, 
offset),
-&offset);
-   pci_read_config_dword(dev, off + offsetof(struct virtio_pci_cap, 
length),
- &length);
+   if (ecap) {
+   pci_read_config_byte(dev, off + offsetof(struct virtio_pci_ecap,
+bar),
+&bar);
+   pci_read_config_dword(dev, off + offsetof(struct 
virtio_pci_ecap,
+ offset),
+&offset);
+   pci_read_config_dword(dev, off + offsetof(struct 
virtio_pci_ecap,
+ length),
+ &length);
+   } else {
+   pci_read_config_byte(dev, off + offsetof(struct virtio_pci_cap,
+bar),
+&bar);
+   pci_read_config_dword(dev, off + offsetof(struct virtio_pci_cap,
+ offset),
+&offset);
+   pci_read_config_dword(dev, off + offsetof(struct virtio_pci_cap,
+ length),
+ &length);
+   }
 
/* Check if the BAR may have changed since we requested the region. */
if (bar >= PCI_STD_NUM_BARS || !(mdev->modern_bars & (1 << bar))) {
@@ -142,6 +157,47 @@ static inline int virtio_pci_find_capability(struct 
pci_dev *dev, u8 cfg_type,
return 0;
 }
 
+/**
+ * virtio_pci_find_ext_capability - walk extended capabilities to find device 
info.
+ * @dev: the pci device
+ * @cfg_type: the VIRTIO_PCI_CAP_* value we seek
+ * @ioresource_types: IORESOURCE_MEM and/or IORESOURCE_IO.
+ * @bars: the bitmask of BARs
+ *
+ * Returns offset of the capability, or 0.
+ */
+static inline int virtio_pci_find_ext_capability(struct pci_dev *dev, u16 
cfg_type,
+u32 ioreso

Re: [PATCH v3 2/2] rust: add tracepoint support

2024-06-25 Thread Boqun Feng
On Fri, Jun 21, 2024 at 02:52:10PM +0200, Alice Ryhl wrote:
[...]
> 
> Hmm, I tried using the support where I have both events and hooks:
> 
> #define CREATE_TRACE_POINTS
> #define CREATE_RUST_TRACE_POINTS
> #include 
> #include 
> 
> But it's not really working. Initially I thought that it's because I
> need to undef DEFINE_RUST_DO_TRACE at the end of this file, but even
> when I added that, I still get this error:
> 
> error: redefinition of 'str__rust_binder__trace_system_name'
> 
> Is the Rust support missing something, or is the answer just that you
> can't have two files of the same name like this? Or am I doing
> something else wrong?
> 

Because your hooks/rust_binder.h and events/rust_binder.h use the same
TRACE_SYSTEM name? Could you try something like:

#define TRACE_SYSTEM rust_binder_hook

in your hooks/rust_binder.h?

Regards,
Boqun

> Alice



Re: [PATCH v2 6/7] mm, slab: add static key for should_failslab()

2024-06-25 Thread Vlastimil Babka
On 6/25/24 7:12 PM, Alexei Starovoitov wrote:
> On Tue, Jun 25, 2024 at 7:24 AM Vlastimil Babka  wrote:
>>
>> On 6/20/24 12:49 AM, Vlastimil Babka wrote:
>> > --- a/mm/slub.c
>> > +++ b/mm/slub.c
>> > @@ -3874,13 +3874,37 @@ static __always_inline void 
>> > maybe_wipe_obj_freeptr(struct kmem_cache *s,
>> >   0, sizeof(void *));
>> >  }
>> >
>> > -noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
>> > +#if defined(CONFIG_FUNCTION_ERROR_INJECTION) || defined(CONFIG_FAILSLAB)
>> > +DEFINE_STATIC_KEY_FALSE(should_failslab_active);
>> > +
>> > +#ifdef CONFIG_FUNCTION_ERROR_INJECTION
>> > +noinline
>> > +#else
>> > +static inline
>> > +#endif
>> > +int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
>>
>> Note that it has been found that (regardless of this series) gcc may clone
>> this to a should_failslab.constprop.0 in case the function is empty because
>> __should_failslab is compiled out (CONFIG_FAILSLAB=n). The "noinline"
>> doesn't help - the original function stays but only the clone is actually
>> being called, thus overriding the original function achieves nothing, see:
>> https://github.com/bpftrace/bpftrace/issues/3258
>>
>> So we could use __noclone to prevent that, and I was thinking by adding
>> something this to error-injection.h:
>>
>> #ifdef CONFIG_FUNCTION_ERROR_INJECTION
>> #define __error_injectable(alternative) noinline __noclone
> 
> To prevent such compiler transformations we typically use
> __used noinline
> 
> We didn't have a need for __noclone yet. If __used is enough I'd stick to 
> that.

__used made no difference here (gcc 13.3), __noclone did



Re: [PATCH v2 6/7] mm, slab: add static key for should_failslab()

2024-06-25 Thread Alexei Starovoitov
On Tue, Jun 25, 2024 at 7:24 AM Vlastimil Babka  wrote:
>
> On 6/20/24 12:49 AM, Vlastimil Babka wrote:
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -3874,13 +3874,37 @@ static __always_inline void 
> > maybe_wipe_obj_freeptr(struct kmem_cache *s,
> >   0, sizeof(void *));
> >  }
> >
> > -noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
> > +#if defined(CONFIG_FUNCTION_ERROR_INJECTION) || defined(CONFIG_FAILSLAB)
> > +DEFINE_STATIC_KEY_FALSE(should_failslab_active);
> > +
> > +#ifdef CONFIG_FUNCTION_ERROR_INJECTION
> > +noinline
> > +#else
> > +static inline
> > +#endif
> > +int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
>
> Note that it has been found that (regardless of this series) gcc may clone
> this to a should_failslab.constprop.0 in case the function is empty because
> __should_failslab is compiled out (CONFIG_FAILSLAB=n). The "noinline"
> doesn't help - the original function stays but only the clone is actually
> being called, thus overriding the original function achieves nothing, see:
> https://github.com/bpftrace/bpftrace/issues/3258
>
> So we could use __noclone to prevent that, and I was thinking by adding
> something this to error-injection.h:
>
> #ifdef CONFIG_FUNCTION_ERROR_INJECTION
> #define __error_injectable(alternative) noinline __noclone

To prevent such compiler transformations we typically use
__used noinline

We didn't have a need for __noclone yet. If __used is enough I'd stick to that.



Re: (subset) [PATCH v9 0/5] soc: qcom: add in-kernel pd-mapper implementation

2024-06-25 Thread Bjorn Andersson


On Sat, 22 Jun 2024 01:03:39 +0300, Dmitry Baryshkov wrote:
> Protection domain mapper is a QMI service providing mapping between
> 'protection domains' and services supported / allowed in these domains.
> For example such mapping is required for loading of the WiFi firmware or
> for properly starting up the UCSI / altmode / battery manager support.
> 
> The existing userspace implementation has several issue. It doesn't play
> well with CONFIG_EXTRA_FIRMWARE, it doesn't reread the JSON files if the
> firmware location is changed (or if the firmware was not available at
> the time pd-mapper was started but the corresponding directory is
> mounted later), etc.
> 
> [...]

Applied, thanks!

[5/5] remoteproc: qcom: enable in-kernel PD mapper
  commit: 5b9f51b200dcb2c3924ecbff324fa52f1faa84d3

Best regards,
-- 
Bjorn Andersson 



Re: [PATCH v3 1/2] rust: add static_key_false

2024-06-25 Thread Boqun Feng
Hi Alice,

On Fri, Jun 21, 2024 at 10:35:26AM +, Alice Ryhl wrote:
> Add just enough support for static key so that we can use it from
> tracepoints. Tracepoints rely on `static_key_false` even though it is
> deprecated, so we add the same functionality to Rust.
> 
> It is not possible to use the existing C implementation of
> arch_static_branch because it passes the argument `key` to inline
> assembly as an 'i' parameter, so any attempt to add a C helper for this
> function will fail to compile because the value of `key` must be known
> at compile-time.
> 
> Signed-off-by: Alice Ryhl 

[Add linux-arch, and related arch maintainers Cced]

Since inline asms are touched here, please consider copying linux-arch
and arch maintainers next time ;-)

> ---
>  rust/kernel/lib.rs|   1 +
>  rust/kernel/static_key.rs | 143 
> ++
>  scripts/Makefile.build|   2 +-
>  3 files changed, 145 insertions(+), 1 deletion(-)
> 
> diff --git a/rust/kernel/lib.rs b/rust/kernel/lib.rs
> index fbd91a48ff8b..a0be9544996d 100644
> --- a/rust/kernel/lib.rs
> +++ b/rust/kernel/lib.rs
> @@ -38,6 +38,7 @@
>  pub mod prelude;
>  pub mod print;
>  mod static_assert;
> +pub mod static_key;
>  #[doc(hidden)]
>  pub mod std_vendor;
>  pub mod str;
> diff --git a/rust/kernel/static_key.rs b/rust/kernel/static_key.rs
> new file mode 100644
> index ..9c844fe3e3a3
> --- /dev/null
> +++ b/rust/kernel/static_key.rs
> @@ -0,0 +1,143 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +// Copyright (C) 2024 Google LLC.
> +
> +//! Logic for static keys.
> +
> +use crate::bindings::*;
> +
> +#[doc(hidden)]
> +#[macro_export]
> +#[cfg(target_arch = "x86_64")]
> +macro_rules! _static_key_false {
> +($key:path, $keytyp:ty, $field:ident) => {'my_label: {
> +core::arch::asm!(
> +r#"
> +1: .byte 0x0f,0x1f,0x44,0x00,0x00
> +
> +.pushsection __jump_table,  "aw"
> +.balign 8
> +.long 1b - .
> +.long {0} - .
> +.quad {1} + {2} - .
> +.popsection
> +"#,
> +label {
> +break 'my_label true;
> +},
> +sym $key,
> +const ::core::mem::offset_of!($keytyp, $field),
> +);
> +
> +break 'my_label false;
> +}};
> +}
> +
> +#[doc(hidden)]
> +#[macro_export]
> +#[cfg(target_arch = "aarch64")]
> +macro_rules! _static_key_false {
> +($key:path, $keytyp:ty, $field:ident) => {'my_label: {
> +core::arch::asm!(
> +r#"
> +1: nop
> +
> +.pushsection __jump_table,  "aw"
> +.align 3
> +.long 1b - ., {0} - .
> +.quad {1} + {2} - .
> +.popsection
> +"#,
> +label {
> +break 'my_label true;
> +},
> +sym $key,
> +const ::core::mem::offset_of!($keytyp, $field),
> +);
> +
> +break 'my_label false;
> +}};
> +}
> +

For x86_64 and arm64 bits:

Acked-by: Boqun Feng 

One thing though, we should split the arch-specific impls into different
files, for example: rust/kernel/arch/arm64.rs or rust/arch/arm64.rs.
That'll be easier for arch maintainers to watch the Rust changes related
to a particular architecture.

Another thought is that, could you implement an arch_static_branch!()
(instead of _static_key_false!()) and use it for static_key_false!()
similar to what we have in C? The benefit is that at least for myself
it'll be easier to compare the implementation between C and Rust.

Regards,
Boqun

> +#[doc(hidden)]
> +#[macro_export]
> +#[cfg(target_arch = "loongarch64")]
> +macro_rules! _static_key_false {
> +($key:path, $keytyp:ty, $field:ident) => {'my_label: {
> +core::arch::asm!(
> +r#"
> +1: nop
> +
> +.pushsection __jump_table,  "aw"
> +.align 3
> +.long 1b - ., {0} - .
> +.quad {1} + {2} - .
> +.popsection
> +"#,
> +label {
> +break 'my_label true;
> +},
> +sym $key,
> +const ::core::mem::offset_of!($keytyp, $field),
> +);
> +
> +break 'my_label false;
> +}};
> +}
> +
> +#[doc(hidden)]
> +#[macro_export]
> +#[cfg(target_arch = "riscv64")]
> +macro_rules! _static_key_false {
> +($key:path, $keytyp:ty, $field:ident) => {'my_label: {
> +core::arch::asm!(
> +r#"
> +.align  2
> +.option push
> +.option norelax
> +.option norvc
> +1: nop
> +.option pop
> +.pushsection __jump_table,  "aw"
> +.align 3
> +.long 1b - ., {0} - .
> +.dword {1} + {2} - .
> +.popsection
> +"#,
> +label {
> +break 'my_label true;
> +},
> +sym $key,
> +

Re: [PATCH 1/8] riscv: stacktrace: convert arch_stack_walk() to noinstr

2024-06-25 Thread Palmer Dabbelt

On Tue, 18 Jun 2024 02:55:32 PDT (-0700), a...@ghiti.fr wrote:

Hi Andy,

On 13/06/2024 09:11, Andy Chiu wrote:

arch_stack_walk() is called intensively in function_graph when the
kernel is compiled with CONFIG_TRACE_IRQFLAGS. As a result, the kernel
logs a lot of arch_stack_walk and its sub-functions into the ftrace
buffer. However, these functions should not appear on the trace log
because they are part of the ftrace itself. This patch references what
arm64 does for the smae function. So it further prevent the re-enter
kprobe issue, which is also possible on riscv.

Related-to: commit 0fbcd8abf337 ("arm64: Prohibit instrumentation on 
arch_stack_walk()")
Fixes: 680341382da5 ("riscv: add CALLER_ADDRx support")
Signed-off-by: Andy Chiu 
---
  arch/riscv/kernel/stacktrace.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/riscv/kernel/stacktrace.c b/arch/riscv/kernel/stacktrace.c
index 528ec7cc9a62..0d3f00eb0bae 100644
--- a/arch/riscv/kernel/stacktrace.c
+++ b/arch/riscv/kernel/stacktrace.c
@@ -156,7 +156,7 @@ unsigned long __get_wchan(struct task_struct *task)
return pc;
  }

-noinline void arch_stack_walk(stack_trace_consume_fn consume_entry, void 
*cookie,
+noinline noinstr void arch_stack_walk(stack_trace_consume_fn consume_entry, 
void *cookie,
 struct task_struct *task, struct pt_regs *regs)
  {
walk_stackframe(task, regs, consume_entry, cookie);



Reviewed-by: Alexandre Ghiti 

I'll try to make this go into -fixes, this is in my fixes branch at least.


Looks like there's some comments on the rest of the patch set.

Andy: aree you going to send a v2, or do you want me to just pick this 
one up onto fixes now and then handle the rest later?




Thanks,

Alex




Re: [PATCH v3] x86/paravirt: Disable virt spinlock on bare metal

2024-06-25 Thread Nikolay Borisov




On 25.06.24 г. 17:50 ч., Chen Yu wrote:

On 2024-06-25 at 16:42:11 +0300, Nikolay Borisov wrote:



On 25.06.24 г. 15:54 ч., Chen Yu wrote:

The kernel can change spinlock behavior when running as a guest. But
this guest-friendly behavior causes performance problems on bare metal.
So there's a 'virt_spin_lock_key' static key to switch between the two
modes.

The static key is always enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior).

Performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
is disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which
decrease the performance on bare metal.

Set the default value of virt_spin_lock_key to false. If booting in a VM,
enable this key. Later during the VM initialization, if other
high-efficient spinlock is preferred(paravirt-spinlock eg), the
virt_spin_lock_key is disabled accordingly. The relation is described as
below:

X86_FEATURE_HYPERVISOR YYY N
CONFIG_PARAVIRT_SPINLOCKS  YYN Y/N
PV spinlockYNN Y/N

virt_spin_lock_key NNY N

-DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
+DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key);
   void __init native_pv_lock_init(void)
   {
-   if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&


Actually now shouldn't the CONFIG_PARAVIRT_SPINLOCKS check be retained?
Otherwise we'll have the virtspinlock enabled even if we are a guest but
CONFIG_PARAVIRT_SPINLOCKS is disabled, no ?



It seems to be the expected behavior? If CONFIG_PARAVIRT_SPINLOCKS is disabled,
should the virt_spin_lock_key be enabled in the guest?


No, but if it's disabled and we are under a hypervisor shouldn't the 
virt spinlock be kept disabled? As it stands now everytime we are under 
a hypervisor the virt spinlock is enabled irrespective of the 
PARAVIRT_SPINLOCK config state.



The previous behavior before commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"): kvm_spinlock_init() is NULL if
CONFIG_PARAVIRT_SPINLOCKS is disabled, and 
static_branch_disable(&virt_spin_lock_key)
can not be invoked, so the virt_spin_lock_key keeps enabled.

thanks,
Chenyu





Re: [PATCH] x86/vmware: fix panic in vmware_hypercall_slow()

2024-06-25 Thread Alexey Makhalov




On 6/25/24 7:51 AM, Borislav Petkov wrote:

On Tue, Jun 25, 2024 at 07:45:50AM -0700, Alexey Makhalov wrote:

My test environment was screwed up during the last version of the patchset.
I was using a kernel which was built previously and didn't pay attention to
commit hash suffix in `uname -r`.
Human mistake, apologize for that.


Ok, no worries, happens.


Alex found it while doing TDX testing on x86/vmware on tip.

Do you want me to resubmit the patchset to do not brake a git bisect?


Nah, I'll fold it into the broken patch and rebase.


Thanks for your time.



Btw, pls do me a favor and do not top-post but put your reply underneath the
quoted text, like we all do.

Noted.



Thx.





Re: [PATCH] x86/vmware: fix panic in vmware_hypercall_slow()

2024-06-25 Thread Borislav Petkov
On Tue, Jun 25, 2024 at 07:45:50AM -0700, Alexey Makhalov wrote:
> My test environment was screwed up during the last version of the patchset.
> I was using a kernel which was built previously and didn't pay attention to
> commit hash suffix in `uname -r`.
> Human mistake, apologize for that.

Ok, no worries, happens.

> Alex found it while doing TDX testing on x86/vmware on tip.
> 
> Do you want me to resubmit the patchset to do not brake a git bisect?

Nah, I'll fold it into the broken patch and rebase.

Btw, pls do me a favor and do not top-post but put your reply underneath the
quoted text, like we all do.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



Re: [PATCH v3] x86/paravirt: Disable virt spinlock on bare metal

2024-06-25 Thread Chen Yu
On 2024-06-25 at 16:42:11 +0300, Nikolay Borisov wrote:
> 
> 
> On 25.06.24 г. 15:54 ч., Chen Yu wrote:
> > The kernel can change spinlock behavior when running as a guest. But
> > this guest-friendly behavior causes performance problems on bare metal.
> > So there's a 'virt_spin_lock_key' static key to switch between the two
> > modes.
> > 
> > The static key is always enabled by default (run in guest mode) and
> > should be disabled for bare metal (and in some guests that want native
> > behavior).
> > 
> > Performance drop is reported when running encode/decode workload and
> > BenchSEE cache sub-workload.
> > Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
> > native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
> > is disabled the virt_spin_lock_key is incorrectly set to true on bare
> > metal. The qspinlock degenerates to test-and-set spinlock, which
> > decrease the performance on bare metal.
> > 
> > Set the default value of virt_spin_lock_key to false. If booting in a VM,
> > enable this key. Later during the VM initialization, if other
> > high-efficient spinlock is preferred(paravirt-spinlock eg), the
> > virt_spin_lock_key is disabled accordingly. The relation is described as
> > below:
> > 
> > X86_FEATURE_HYPERVISOR YYY N
> > CONFIG_PARAVIRT_SPINLOCKS  YYN Y/N
> > PV spinlockYNN Y/N
> > 
> > virt_spin_lock_key NNY N
> > 
> > -DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
> > +DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key);
> >   void __init native_pv_lock_init(void)
> >   {
> > -   if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
> 
> Actually now shouldn't the CONFIG_PARAVIRT_SPINLOCKS check be retained?
> Otherwise we'll have the virtspinlock enabled even if we are a guest but
> CONFIG_PARAVIRT_SPINLOCKS is disabled, no ?
>

It seems to be the expected behavior? If CONFIG_PARAVIRT_SPINLOCKS is disabled,
should the virt_spin_lock_key be enabled in the guest?
The previous behavior before commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"): kvm_spinlock_init() is NULL if
CONFIG_PARAVIRT_SPINLOCKS is disabled, and 
static_branch_disable(&virt_spin_lock_key)
can not be invoked, so the virt_spin_lock_key keeps enabled.

thanks,
Chenyu




Re: [PATCH] x86/vmware: fix panic in vmware_hypercall_slow()

2024-06-25 Thread Alexey Makhalov
My test environment was screwed up during the last version of the 
patchset. I was using a kernel which was built previously and didn't pay 
attention to commit hash suffix in `uname -r`.

Human mistake, apologize for that.

Alex found it while doing TDX testing on x86/vmware on tip.

Do you want me to resubmit the patchset to do not brake a git bisect?

On 6/25/24 1:47 AM, Borislav Petkov wrote:

On Tue, Jun 25, 2024 at 01:33:48AM -0700, Alexey Makhalov wrote:

Caller of vmware_hypercall_slow() can pass NULL into *out1,
*out2,... *out5. It will lead to a NULL pointer dereference.

Check a pointer for NULL before assigning a value.


I queue your patches and *now* you find this?!

How did you test them in the first place and why was this scenario missed?

Geez.





Re: [PATCH v2 2/7] error-injection: support static keys around injectable functions

2024-06-25 Thread Steven Rostedt
On Thu, 20 Jun 2024 00:48:56 +0200
Vlastimil Babka  wrote:

> @@ -86,6 +104,7 @@ static void populate_error_injection_list(struct 
> error_injection_entry *start,
>   ent->start_addr = entry;
>   ent->end_addr = entry + size;
>   ent->etype = iter->etype;
> + ent->key = (struct static_key *) iter->static_key_addr;

Nit, should there be a space between the typecast and the "iter"?

>   ent->priv = priv;
>   INIT_LIST_HEAD(&ent->list);
>   list_add_tail(&ent->list, &error_injection_list);

Reviewed-by: Steven Rostedt (Google) 

-- Steve



Re: [PATCH v2 6/7] mm, slab: add static key for should_failslab()

2024-06-25 Thread Vlastimil Babka
On 6/20/24 12:49 AM, Vlastimil Babka wrote:
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3874,13 +3874,37 @@ static __always_inline void 
> maybe_wipe_obj_freeptr(struct kmem_cache *s,
>   0, sizeof(void *));
>  }
>  
> -noinline int should_failslab(struct kmem_cache *s, gfp_t gfpflags)
> +#if defined(CONFIG_FUNCTION_ERROR_INJECTION) || defined(CONFIG_FAILSLAB)
> +DEFINE_STATIC_KEY_FALSE(should_failslab_active);
> +
> +#ifdef CONFIG_FUNCTION_ERROR_INJECTION
> +noinline
> +#else
> +static inline
> +#endif
> +int should_failslab(struct kmem_cache *s, gfp_t gfpflags)

Note that it has been found that (regardless of this series) gcc may clone
this to a should_failslab.constprop.0 in case the function is empty because
__should_failslab is compiled out (CONFIG_FAILSLAB=n). The "noinline"
doesn't help - the original function stays but only the clone is actually
being called, thus overriding the original function achieves nothing, see:
https://github.com/bpftrace/bpftrace/issues/3258

So we could use __noclone to prevent that, and I was thinking by adding
something this to error-injection.h:

#ifdef CONFIG_FUNCTION_ERROR_INJECTION
#define __error_injectable(alternative) noinline __noclone
#else
#define __error_injectable(alternative) alternative
#endif

and the usage here would be:

__error_injectable(static inline) int should_failslab(...)

Does that look acceptable, or is it too confusing that "static inline" is
specified there as the storage class to use when error injection is actually
disabled?

>  {
>   if (__should_failslab(s, gfpflags))
>   return -ENOMEM;
>   return 0;
>  }
> -ALLOW_ERROR_INJECTION(should_failslab, ERRNO);
> +ALLOW_ERROR_INJECTION_KEY(should_failslab, ERRNO, &should_failslab_active);
> +
> +static __always_inline int should_failslab_wrapped(struct kmem_cache *s,
> +gfp_t gfp)
> +{
> + if (static_branch_unlikely(&should_failslab_active))
> + return should_failslab(s, gfp);
> + else
> + return 0;
> +}
> +#else
> +static __always_inline int should_failslab_wrapped(struct kmem_cache *s,
> +gfp_t gfp)
> +{
> + return false;
> +}
> +#endif
>  
>  static __fastpath_inline
>  struct kmem_cache *slab_pre_alloc_hook(struct kmem_cache *s, gfp_t flags)
> @@ -3889,7 +3913,7 @@ struct kmem_cache *slab_pre_alloc_hook(struct 
> kmem_cache *s, gfp_t flags)
>  
>   might_alloc(flags);
>  
> - if (unlikely(should_failslab(s, flags)))
> + if (should_failslab_wrapped(s, flags))
>   return NULL;
>  
>   return s;
> 




Re: [PATCH v2 1/7] fault-inject: add support for static keys around fault injection sites

2024-06-25 Thread Steven Rostedt
On Thu, 20 Jun 2024 00:48:55 +0200
Vlastimil Babka  wrote:

> +static int debugfs_prob_set(void *data, u64 val)
> +{
> + struct fault_attr *attr = data;
> +
> + mutex_lock(&probability_mutex);
> +
> + if (attr->active) {
> + if (attr->probability != 0 && val == 0) {
> + static_key_slow_dec(attr->active);
> + } else if (attr->probability == 0 && val != 0) {
> + static_key_slow_inc(attr->active);
> + }
> + }

So basically the above is testing if val to probability is going from
zero or non-zero. For such cases, I find it less confusing to have:

if (attr->active) {
if (!!attr->probability != !!val) {
if (val)
static_key_slow_inc(attr->active);
else
static_key_slow_dec(attr->active);
}
}

This does add a layer of nested ifs, but IMO it's a bit more clear in
what is happening, and it gets rid of the "else if".

Not saying you need to change it. This is more of an FYI.

-- Steve


> +
> + attr->probability = val;
> +
> + mutex_unlock(&probability_mutex);
> +
> + return 0;
> +}



Re: [RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic

2024-06-25 Thread Arseniy Krasnov



On 25.06.2024 16:46, Stefano Garzarella wrote:
> On Fri, Jun 21, 2024 at 10:25:40PM GMT, Arseniy Krasnov wrote:
>> Previous calculation of 'free_space' was wrong (but worked as expected
>> in most cases, see below), because it didn't account number of bytes in
>> rx queue. Let's rework 'free_space' calculation in the following way:
>> as this value is considered free space at rx side from tx point of view,
>> it must be equal to return value of 'virtio_transport_get_credit()' at
>> tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first
>> is number of transmitted bytes (without wrap), second is last 'fwd_cnt'
>> value received from rx. So let's use same approach at rx side during
>> 'free_space' calculation: add 'rx_cnt' counter which is number of
>> received bytes (also without wrap) and subtract 'last_fwd_cnt' from it.
>> Now we have:
>> 1) 'rx_cnt' == 'tx_cnt' at both sides.
>> 2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt'
>>   sent to tx, while second is last 'fwd_cnt' received from rx.
>>
>> Now 'free_space' is handled correctly and also we don't need
> 
> mmm, I don't know if it was wrong before, maybe we could say it was less 
> accurate.

May be "now 'free_space' is handled in more precise way and also we ..." ?

> 
> That said, could we have the same problem now if we have a lot of producers 
> and the virtqueue becomes full?
> 

I guess if virtqueue is full, we just wait by returning skb back to tx queue... 
e.g.
data exchange between two sockets just freezes. ?

>> 'low_rx_bytes' flag - this was more like a hack.
>>
>> Previous calculation of 'free_space' worked (in 99% cases), because if
>> we take a look on behaviour of both expressions (new and previous):
>>
>> '(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)'
>>
>> Both of them always grows up, with almost same "speed": only difference
>> is that 'rx_cnt' is incremented earlier during packet is received,
>> while 'fwd_cnt' in incremented when packet is read by user. So if 'rx_cnt'
>> grows "faster", then resulting 'free_space' become smaller also, so we
>> send credit updates a little bit more, but:
>>
>>  * 'free_space' calculation based on 'rx_cnt' gives the same value,
>>    which tx sees as free space at rx side, so original idea of
> 
> Ditto, what happen if the virtqueue is full?
> 
>>    'free_space' is now implemented as planned.
>>  * Hack with 'low_rx_bytes' now is not needed.
> 
> Yeah, so this patch should also mitigate issue reported by Alex (added in 
> CC), right?
> 
> If yes, please mention that problem and add a Reported-by giving credit to 
> Alex.

Yes, of course!

> 
>>
>> Also here is some performance comparison between both versions of
>> 'free_space' calculation:
>>
>> *--*--*--*
>> |  | 'rx_cnt' | previous |
>> *--*--*--*
>> |H -> G|   8.42   |   7.82   |
>> *--*--*--*
>> |G -> H|   11.6   |   12.1   |
>> *--*--*--*
> 
> How many seconds did you run it? How many repetitions? There's a little 
> discrepancy anyway, but I can't tell if it's just noise.

I run 4 times, each run for ~10 seconds... I think I can also add number of 
credit update messages to this report.

> 
>>
>> As benchmark 'vsock-iperf' with default arguments was used. There is no
>> significant performance difference before and after this patch.
>>
>> Signed-off-by: Arseniy Krasnov 
>> ---
>> include/linux/virtio_vsock.h    | 1 +
>> net/vmw_vsock/virtio_transport_common.c | 8 +++-
>> 2 files changed, 4 insertions(+), 5 deletions(-)
> 
> Thanks for working on this, I'll do more tests but the approach LGTM.

Got it, Thanks

> 
> Thanks,
> Stefano
> 
>>
>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> index c82089dee0c8..3579491c411e 100644
>> --- a/include/linux/virtio_vsock.h
>> +++ b/include/linux/virtio_vsock.h
>> @@ -135,6 +135,7 @@ struct virtio_vsock_sock {
>> u32 peer_buf_alloc;
>>
>> /* Protected by rx_lock */
>> +    u32 rx_cnt;
>> u32 fwd_cnt;
>> u32 last_fwd_cnt;
>> u32 rx_bytes;
>> diff --git a/net/vmw_vsock/virtio_transport_common.c 
>> b/net/vmw_vsock/virtio_transport_common.c
>> index 16ff976a86e3..1d4e2328e06e 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct 
>> virtio_vsock_sock *vvs,
>>     return false;
>>
>> vvs->rx_bytes += len;
>> +    vvs->rx_cnt += len;
>> return true;
>> }
>>
>> @@ -558,7 +559,6 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
>> *vsk,
>> size_t bytes, total = 0;
>> struct sk_buff *skb;
>> u32 fwd_cnt_delta;
>> -    bool low_rx_bytes;
>> int err = -EFAULT;
>> u32 free_space;
>>
>> @@ -603,9 +603,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock 
>> *vsk,
>> }
>>
>> fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
>> -    free_space = vvs->bu

[PATCH net-next v9 05/13] mm: page_frag: avoid caller accessing 'page_frag_cache' directly

2024-06-25 Thread Yunsheng Lin
Use appropriate frag_page API instead of caller accessing
'page_frag_cache' directly.

CC: Alexander Duyck 
Signed-off-by: Yunsheng Lin 
---
 drivers/vhost/net.c |  2 +-
 include/linux/page_frag_cache.h | 10 ++
 mm/page_frag_test.c |  2 +-
 net/core/skbuff.c   |  6 +++---
 net/rxrpc/conn_object.c |  4 +---
 net/rxrpc/local_object.c|  4 +---
 net/sunrpc/svcsock.c|  6 ++
 7 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 6691fac01e0d..b2737dc0dc50 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1325,7 +1325,7 @@ static int vhost_net_open(struct inode *inode, struct 
file *f)
vqs[VHOST_NET_VQ_RX]);
 
f->private_data = n;
-   n->pf_cache.va = NULL;
+   page_frag_cache_init(&n->pf_cache);
 
return 0;
 }
diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index c6fde197a6eb..6ac3a25089d1 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -23,6 +23,16 @@ struct page_frag_cache {
bool pfmemalloc;
 };
 
+static inline void page_frag_cache_init(struct page_frag_cache *nc)
+{
+   nc->va = NULL;
+}
+
+static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc)
+{
+   return !!nc->pfmemalloc;
+}
+
 void page_frag_cache_drain(struct page_frag_cache *nc);
 void __page_frag_cache_drain(struct page *page, unsigned int count);
 void *__page_frag_alloc_va_align(struct page_frag_cache *nc,
diff --git a/mm/page_frag_test.c b/mm/page_frag_test.c
index a0bd0ca5f343..cdffebc20a10 100644
--- a/mm/page_frag_test.c
+++ b/mm/page_frag_test.c
@@ -341,7 +341,7 @@ static int __init page_frag_test_init(void)
u64 duration;
int ret;
 
-   test_frag.va = NULL;
+   page_frag_cache_init(&test_frag);
atomic_set(&nthreads, 2);
init_completion(&wait);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6a84cc929505..59d42d642067 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -749,14 +749,14 @@ struct sk_buff *__netdev_alloc_skb(struct net_device 
*dev, unsigned int len,
if (in_hardirq() || irqs_disabled()) {
nc = this_cpu_ptr(&netdev_alloc_cache);
data = page_frag_alloc_va(nc, len, gfp_mask);
-   pfmemalloc = nc->pfmemalloc;
+   pfmemalloc = page_frag_cache_is_pfmemalloc(nc);
} else {
local_bh_disable();
local_lock_nested_bh(&napi_alloc_cache.bh_lock);
 
nc = this_cpu_ptr(&napi_alloc_cache.page);
data = page_frag_alloc_va(nc, len, gfp_mask);
-   pfmemalloc = nc->pfmemalloc;
+   pfmemalloc = page_frag_cache_is_pfmemalloc(nc);
 
local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
local_bh_enable();
@@ -846,7 +846,7 @@ struct sk_buff *napi_alloc_skb(struct napi_struct *napi, 
unsigned int len)
len = SKB_HEAD_ALIGN(len);
 
data = page_frag_alloc_va(&nc->page, len, gfp_mask);
-   pfmemalloc = nc->page.pfmemalloc;
+   pfmemalloc = page_frag_cache_is_pfmemalloc(&nc->page);
}
local_unlock_nested_bh(&napi_alloc_cache.bh_lock);
 
diff --git a/net/rxrpc/conn_object.c b/net/rxrpc/conn_object.c
index 1539d315afe7..694c4df7a1a3 100644
--- a/net/rxrpc/conn_object.c
+++ b/net/rxrpc/conn_object.c
@@ -337,9 +337,7 @@ static void rxrpc_clean_up_connection(struct work_struct 
*work)
 */
rxrpc_purge_queue(&conn->rx_queue);
 
-   if (conn->tx_data_alloc.va)
-   __page_frag_cache_drain(virt_to_page(conn->tx_data_alloc.va),
-   conn->tx_data_alloc.pagecnt_bias);
+   page_frag_cache_drain(&conn->tx_data_alloc);
call_rcu(&conn->rcu, rxrpc_rcu_free_connection);
 }
 
diff --git a/net/rxrpc/local_object.c b/net/rxrpc/local_object.c
index 504453c688d7..a8cffe47cf01 100644
--- a/net/rxrpc/local_object.c
+++ b/net/rxrpc/local_object.c
@@ -452,9 +452,7 @@ void rxrpc_destroy_local(struct rxrpc_local *local)
 #endif
rxrpc_purge_queue(&local->rx_queue);
rxrpc_purge_client_connections(local);
-   if (local->tx_alloc.va)
-   __page_frag_cache_drain(virt_to_page(local->tx_alloc.va),
-   local->tx_alloc.pagecnt_bias);
+   page_frag_cache_drain(&local->tx_alloc);
 }
 
 /*
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 42d20412c1c3..4b1e87187614 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1609,7 +1609,6 @@ static void svc_tcp_sock_detach(struct svc_xprt *xprt)
 static void svc_sock_free(struct svc_xprt *xprt)
 {
struct svc_sock *svsk = container_of(xprt, struct svc_sock, sk_xprt);
-   struct page_frag_cache *pfc = &svsk->sk_frag_cache;
struct socket *sock = svsk->sk_sock;
 
 

[PATCH net-next v9 04/13] mm: page_frag: add '_va' suffix to page_frag API

2024-06-25 Thread Yunsheng Lin
Currently the page_frag API is returning 'virtual address'
or 'va' when allocing and expecting 'virtual address' or
'va' as input when freeing.

As we are about to support new use cases that the caller
need to deal with 'struct page' or need to deal with both
'va' and 'struct page'. In order to differentiate the API
handling between 'va' and 'struct page', add '_va' suffix
to the corresponding API mirroring the page_pool_alloc_va()
API of the page_pool. So that callers expecting to deal with
va, page or both va and page may call page_frag_alloc_va*,
page_frag_alloc_pg*, or page_frag_alloc* API accordingly.

CC: Alexander Duyck 
Signed-off-by: Yunsheng Lin 
---
 drivers/net/ethernet/google/gve/gve_rx.c  |  4 ++--
 drivers/net/ethernet/intel/ice/ice_txrx.c |  2 +-
 drivers/net/ethernet/intel/ice/ice_txrx.h |  2 +-
 drivers/net/ethernet/intel/ice/ice_txrx_lib.c |  2 +-
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |  4 ++--
 .../marvell/octeontx2/nic/otx2_common.c   |  2 +-
 drivers/net/ethernet/mediatek/mtk_wed_wo.c|  4 ++--
 drivers/nvme/host/tcp.c   |  8 +++
 drivers/nvme/target/tcp.c | 22 +--
 drivers/vhost/net.c   |  6 ++---
 include/linux/page_frag_cache.h   | 21 +-
 include/linux/skbuff.h|  2 +-
 kernel/bpf/cpumap.c   |  2 +-
 mm/page_frag_cache.c  | 12 +-
 mm/page_frag_test.c   | 13 ++-
 net/core/skbuff.c | 14 ++--
 net/core/xdp.c|  2 +-
 net/rxrpc/txbuf.c | 15 +++--
 net/sunrpc/svcsock.c  |  6 ++---
 19 files changed, 74 insertions(+), 69 deletions(-)

diff --git a/drivers/net/ethernet/google/gve/gve_rx.c 
b/drivers/net/ethernet/google/gve/gve_rx.c
index acb73d4d0de6..b6c10100e462 100644
--- a/drivers/net/ethernet/google/gve/gve_rx.c
+++ b/drivers/net/ethernet/google/gve/gve_rx.c
@@ -729,7 +729,7 @@ static int gve_xdp_redirect(struct net_device *dev, struct 
gve_rx_ring *rx,
 
total_len = headroom + SKB_DATA_ALIGN(len) +
SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-   frame = page_frag_alloc(&rx->page_cache, total_len, GFP_ATOMIC);
+   frame = page_frag_alloc_va(&rx->page_cache, total_len, GFP_ATOMIC);
if (!frame) {
u64_stats_update_begin(&rx->statss);
rx->xdp_alloc_fails++;
@@ -742,7 +742,7 @@ static int gve_xdp_redirect(struct net_device *dev, struct 
gve_rx_ring *rx,
 
err = xdp_do_redirect(dev, &new, xdp_prog);
if (err)
-   page_frag_free(frame);
+   page_frag_free_va(frame);
 
return err;
 }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c 
b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 8bb743f78fcb..399b317c509d 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -126,7 +126,7 @@ ice_unmap_and_free_tx_buf(struct ice_tx_ring *ring, struct 
ice_tx_buf *tx_buf)
dev_kfree_skb_any(tx_buf->skb);
break;
case ICE_TX_BUF_XDP_TX:
-   page_frag_free(tx_buf->raw_buf);
+   page_frag_free_va(tx_buf->raw_buf);
break;
case ICE_TX_BUF_XDP_XMIT:
xdp_return_frame(tx_buf->xdpf);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h 
b/drivers/net/ethernet/intel/ice/ice_txrx.h
index feba314a3fe4..6379f57d8228 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -148,7 +148,7 @@ static inline int ice_skb_pad(void)
  * @ICE_TX_BUF_DUMMY: dummy Flow Director packet, unmap and kfree()
  * @ICE_TX_BUF_FRAG: mapped skb OR &xdp_buff frag, only unmap DMA
  * @ICE_TX_BUF_SKB: &sk_buff, unmap and consume_skb(), update stats
- * @ICE_TX_BUF_XDP_TX: &xdp_buff, unmap and page_frag_free(), stats
+ * @ICE_TX_BUF_XDP_TX: &xdp_buff, unmap and page_frag_free_va(), stats
  * @ICE_TX_BUF_XDP_XMIT: &xdp_frame, unmap and xdp_return_frame(), stats
  * @ICE_TX_BUF_XSK_TX: &xdp_buff on XSk queue, xsk_buff_free(), stats
  */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c 
b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 2719f0e20933..a1a41a14df0d 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -250,7 +250,7 @@ ice_clean_xdp_tx_buf(struct device *dev, struct ice_tx_buf 
*tx_buf,
 
switch (tx_buf->type) {
case ICE_TX_BUF_XDP_TX:
-   page_frag_free(tx_buf->raw_buf);
+   page_frag_free_va(tx_buf->raw_buf);
break;
case ICE_TX_BUF_XDP_XMIT:
xdp_return_frame_bulk(tx_buf->xdpf, bq);
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ix

Re: [RFC PATCH v1 1/2] virtio/vsock: rework deferred credit update logic

2024-06-25 Thread Stefano Garzarella

On Fri, Jun 21, 2024 at 10:25:40PM GMT, Arseniy Krasnov wrote:

Previous calculation of 'free_space' was wrong (but worked as expected
in most cases, see below), because it didn't account number of bytes in
rx queue. Let's rework 'free_space' calculation in the following way:
as this value is considered free space at rx side from tx point of view,
it must be equal to return value of 'virtio_transport_get_credit()' at
tx side. This function uses 'tx_cnt' counter and 'peer_fwd_cnt': first
is number of transmitted bytes (without wrap), second is last 'fwd_cnt'
value received from rx. So let's use same approach at rx side during
'free_space' calculation: add 'rx_cnt' counter which is number of
received bytes (also without wrap) and subtract 'last_fwd_cnt' from it.
Now we have:
1) 'rx_cnt' == 'tx_cnt' at both sides.
2) 'last_fwd_cnt' == 'peer_fwd_cnt' - because first is last 'fwd_cnt'
  sent to tx, while second is last 'fwd_cnt' received from rx.

Now 'free_space' is handled correctly and also we don't need


mmm, I don't know if it was wrong before, maybe we could say it was less 
accurate.


That said, could we have the same problem now if we have a lot of 
producers and the virtqueue becomes full?



'low_rx_bytes' flag - this was more like a hack.

Previous calculation of 'free_space' worked (in 99% cases), because if
we take a look on behaviour of both expressions (new and previous):

'(rx_cnt - last_fwd_cnt)' and '(fwd_cnt - last_fwd_cnt)'

Both of them always grows up, with almost same "speed": only difference
is that 'rx_cnt' is incremented earlier during packet is received,
while 'fwd_cnt' in incremented when packet is read by user. So if 'rx_cnt'
grows "faster", then resulting 'free_space' become smaller also, so we
send credit updates a little bit more, but:

 * 'free_space' calculation based on 'rx_cnt' gives the same value,
   which tx sees as free space at rx side, so original idea of


Ditto, what happen if the virtqueue is full?


   'free_space' is now implemented as planned.
 * Hack with 'low_rx_bytes' now is not needed.


Yeah, so this patch should also mitigate issue reported by Alex (added 
in CC), right?


If yes, please mention that problem and add a Reported-by giving credit 
to Alex.




Also here is some performance comparison between both versions of
'free_space' calculation:

*--*--*--*
|  | 'rx_cnt' | previous |
*--*--*--*
|H -> G|   8.42   |   7.82   |
*--*--*--*
|G -> H|   11.6   |   12.1   |
*--*--*--*


How many seconds did you run it? How many repetitions? There's a little 
discrepancy anyway, but I can't tell if it's just noise.




As benchmark 'vsock-iperf' with default arguments was used. There is no
significant performance difference before and after this patch.

Signed-off-by: Arseniy Krasnov 
---
include/linux/virtio_vsock.h| 1 +
net/vmw_vsock/virtio_transport_common.c | 8 +++-
2 files changed, 4 insertions(+), 5 deletions(-)


Thanks for working on this, I'll do more tests but the approach LGTM.

Thanks,
Stefano



diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index c82089dee0c8..3579491c411e 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -135,6 +135,7 @@ struct virtio_vsock_sock {
u32 peer_buf_alloc;

/* Protected by rx_lock */
+   u32 rx_cnt;
u32 fwd_cnt;
u32 last_fwd_cnt;
u32 rx_bytes;
diff --git a/net/vmw_vsock/virtio_transport_common.c 
b/net/vmw_vsock/virtio_transport_common.c
index 16ff976a86e3..1d4e2328e06e 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -441,6 +441,7 @@ static bool virtio_transport_inc_rx_pkt(struct 
virtio_vsock_sock *vvs,
return false;

vvs->rx_bytes += len;
+   vvs->rx_cnt += len;
return true;
}

@@ -558,7 +559,6 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
size_t bytes, total = 0;
struct sk_buff *skb;
u32 fwd_cnt_delta;
-   bool low_rx_bytes;
int err = -EFAULT;
u32 free_space;

@@ -603,9 +603,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
}

fwd_cnt_delta = vvs->fwd_cnt - vvs->last_fwd_cnt;
-   free_space = vvs->buf_alloc - fwd_cnt_delta;
-   low_rx_bytes = (vvs->rx_bytes <
-   sock_rcvlowat(sk_vsock(vsk), 0, INT_MAX));
+   free_space = vvs->buf_alloc - (vvs->rx_cnt - vvs->last_fwd_cnt);

spin_unlock_bh(&vvs->rx_lock);

@@ -619,7 +617,7 @@ virtio_transport_stream_do_dequeue(struct vsock_sock *vsk,
 * number of bytes in rx queue is not enough to wake up reader.
 */
if (fwd_cnt_delta &&
-   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE || low_rx_bytes))
+   (free_space < VIRTIO_VSOCK_MAX_PKT_BUF_SIZE))
virtio_transport_send_credit_update(vsk);

return total;
--

Re: [PATCH v3] x86/paravirt: Disable virt spinlock on bare metal

2024-06-25 Thread Nikolay Borisov




On 25.06.24 г. 15:54 ч., Chen Yu wrote:

The kernel can change spinlock behavior when running as a guest. But
this guest-friendly behavior causes performance problems on bare metal.
So there's a 'virt_spin_lock_key' static key to switch between the two
modes.

The static key is always enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior).

Performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
is disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which
decrease the performance on bare metal.

Set the default value of virt_spin_lock_key to false. If booting in a VM,
enable this key. Later during the VM initialization, if other
high-efficient spinlock is preferred(paravirt-spinlock eg), the
virt_spin_lock_key is disabled accordingly. The relation is described as
below:

X86_FEATURE_HYPERVISOR YYY N
CONFIG_PARAVIRT_SPINLOCKS  YYN Y/N
PV spinlockYNN Y/N

virt_spin_lock_key NNY N

Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() function 
warning")
Suggested-by: Dave Hansen 
Suggested-by: Qiuxu Zhuo 
Suggested-by: Nikolay Borisov 
Reported-by: Prem Nath Dey 
Reported-by: Xiaoping Zhou 
Signed-off-by: Chen Yu 
---
v2._v3:
   Change the default value of virt_spin_lock_key from true to false.
   Enable this key when it is in the VM, and disable it when needed.
   This makes the code more readable. (Nikolay Borisov)
   Dropped Reviewed-by because the code has been changed.
v1->v2:
   Refine the commit log per Dave's suggestion.
   Simplify the fix by directly disabling the virt_spin_lock_key on bare metal.
   Collect Reviewed-by from Juergen.
---
  arch/x86/include/asm/qspinlock.h | 4 ++--
  arch/x86/kernel/paravirt.c   | 7 +++
  2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index a053c1293975..a32bd2aabdf9 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -66,13 +66,13 @@ static inline bool vcpu_is_preempted(long cpu)
  
  #ifdef CONFIG_PARAVIRT

  /*
- * virt_spin_lock_key - enables (by default) the virt_spin_lock() hijack.
+ * virt_spin_lock_key - disables (by default) the virt_spin_lock() hijack.
   *
   * Native (and PV wanting native due to vCPU pinning) should disable this key.
   * It is done in this backwards fashion to only have a single direction 
change,
   * which removes ordering between native_pv_spin_init() and HV setup.
   */
-DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key);
+DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key);
  
  /*

   * Shortcut for the queued_spin_lock_slowpath() function that allows
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5358d43886ad..fec381533555 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -51,13 +51,12 @@ DEFINE_ASM_FUNC(pv_native_irq_enable, "sti", .noinstr.text);
  DEFINE_ASM_FUNC(pv_native_read_cr2, "mov %cr2, %rax", .noinstr.text);
  #endif
  
-DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);

+DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key);
  
  void __init native_pv_lock_init(void)

  {
-   if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&


Actually now shouldn't the CONFIG_PARAVIRT_SPINLOCKS check be retained? 
Otherwise we'll have the virtspinlock enabled even if we are a guest but 
CONFIG_PARAVIRT_SPINLOCKS is disabled, no ?



-   !boot_cpu_has(X86_FEATURE_HYPERVISOR))
-   static_branch_disable(&virt_spin_lock_key);
+   if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+   static_branch_enable(&virt_spin_lock_key);
  }
  
  static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)




[PATCH v3] x86/paravirt: Disable virt spinlock on bare metal

2024-06-25 Thread Chen Yu
The kernel can change spinlock behavior when running as a guest. But
this guest-friendly behavior causes performance problems on bare metal.
So there's a 'virt_spin_lock_key' static key to switch between the two
modes.

The static key is always enabled by default (run in guest mode) and
should be disabled for bare metal (and in some guests that want native
behavior).

Performance drop is reported when running encode/decode workload and
BenchSEE cache sub-workload.
Bisect points to commit ce0a1b608bfc ("x86/paravirt: Silence unused
native_pv_lock_init() function warning"). When CONFIG_PARAVIRT_SPINLOCKS
is disabled the virt_spin_lock_key is incorrectly set to true on bare
metal. The qspinlock degenerates to test-and-set spinlock, which
decrease the performance on bare metal.

Set the default value of virt_spin_lock_key to false. If booting in a VM,
enable this key. Later during the VM initialization, if other
high-efficient spinlock is preferred(paravirt-spinlock eg), the
virt_spin_lock_key is disabled accordingly. The relation is described as
below:

X86_FEATURE_HYPERVISOR YYY N
CONFIG_PARAVIRT_SPINLOCKS  YYN Y/N
PV spinlockYNN Y/N

virt_spin_lock_key NNY N

Fixes: ce0a1b608bfc ("x86/paravirt: Silence unused native_pv_lock_init() 
function warning")
Suggested-by: Dave Hansen 
Suggested-by: Qiuxu Zhuo 
Suggested-by: Nikolay Borisov 
Reported-by: Prem Nath Dey 
Reported-by: Xiaoping Zhou 
Signed-off-by: Chen Yu 
---
v2._v3:
  Change the default value of virt_spin_lock_key from true to false.
  Enable this key when it is in the VM, and disable it when needed.
  This makes the code more readable. (Nikolay Borisov)
  Dropped Reviewed-by because the code has been changed.
v1->v2:
  Refine the commit log per Dave's suggestion.
  Simplify the fix by directly disabling the virt_spin_lock_key on bare metal.
  Collect Reviewed-by from Juergen.
---
 arch/x86/include/asm/qspinlock.h | 4 ++--
 arch/x86/kernel/paravirt.c   | 7 +++
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index a053c1293975..a32bd2aabdf9 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -66,13 +66,13 @@ static inline bool vcpu_is_preempted(long cpu)
 
 #ifdef CONFIG_PARAVIRT
 /*
- * virt_spin_lock_key - enables (by default) the virt_spin_lock() hijack.
+ * virt_spin_lock_key - disables (by default) the virt_spin_lock() hijack.
  *
  * Native (and PV wanting native due to vCPU pinning) should disable this key.
  * It is done in this backwards fashion to only have a single direction change,
  * which removes ordering between native_pv_spin_init() and HV setup.
  */
-DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key);
+DECLARE_STATIC_KEY_FALSE(virt_spin_lock_key);
 
 /*
  * Shortcut for the queued_spin_lock_slowpath() function that allows
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 5358d43886ad..fec381533555 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -51,13 +51,12 @@ DEFINE_ASM_FUNC(pv_native_irq_enable, "sti", .noinstr.text);
 DEFINE_ASM_FUNC(pv_native_read_cr2, "mov %cr2, %rax", .noinstr.text);
 #endif
 
-DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
+DEFINE_STATIC_KEY_FALSE(virt_spin_lock_key);
 
 void __init native_pv_lock_init(void)
 {
-   if (IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) &&
-   !boot_cpu_has(X86_FEATURE_HYPERVISOR))
-   static_branch_disable(&virt_spin_lock_key);
+   if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
+   static_branch_enable(&virt_spin_lock_key);
 }
 
 static void native_tlb_remove_table(struct mmu_gather *tlb, void *table)
-- 
2.25.1




Re: [PATCH] x86/vmware: fix panic in vmware_hypercall_slow()

2024-06-25 Thread Borislav Petkov
On Tue, Jun 25, 2024 at 01:33:48AM -0700, Alexey Makhalov wrote:
> Caller of vmware_hypercall_slow() can pass NULL into *out1,
> *out2,... *out5. It will lead to a NULL pointer dereference.
> 
> Check a pointer for NULL before assigning a value.

I queue your patches and *now* you find this?!

How did you test them in the first place and why was this scenario missed?

Geez.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette



[PATCH] x86/vmware: fix panic in vmware_hypercall_slow()

2024-06-25 Thread Alexey Makhalov
Caller of vmware_hypercall_slow() can pass NULL into *out1,
*out2,... *out5. It will lead to a NULL pointer dereference.

Check a pointer for NULL before assigning a value.

Fixes:  666cbb562d05d ("x86/vmware: Introduce VMware hypercall API")
Co-developed-by: Alex James 
Signed-off-by: Alex James 
Signed-off-by: Alexey Makhalov 
---
 arch/x86/kernel/cpu/vmware.c | 26 +++---
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/cpu/vmware.c b/arch/x86/kernel/cpu/vmware.c
index 55903563afd3..16da970499f2 100644
--- a/arch/x86/kernel/cpu/vmware.c
+++ b/arch/x86/kernel/cpu/vmware.c
@@ -72,13 +72,13 @@ unsigned long vmware_hypercall_slow(unsigned long cmd,
u32 *out1, u32 *out2, u32 *out3,
u32 *out4, u32 *out5)
 {
-   unsigned long out0;
+   unsigned long out0, rbx, rcx, rdx, rsi, rdi;
 
switch (vmware_hypercall_mode) {
case CPUID_VMWARE_FEATURES_ECX_VMCALL:
asm_inline volatile ("vmcall"
-   : "=a" (out0), "=b" (*out1), "=c" (*out2),
-   "=d" (*out3), "=S" (*out4), "=D" (*out5)
+   : "=a" (out0), "=b" (rbx), "=c" (rcx),
+   "=d" (rdx), "=S" (rsi), "=D" (rdi)
: "a" (VMWARE_HYPERVISOR_MAGIC),
"b" (in1),
"c" (cmd),
@@ -89,8 +89,8 @@ unsigned long vmware_hypercall_slow(unsigned long cmd,
break;
case CPUID_VMWARE_FEATURES_ECX_VMMCALL:
asm_inline volatile ("vmmcall"
-   : "=a" (out0), "=b" (*out1), "=c" (*out2),
-   "=d" (*out3), "=S" (*out4), "=D" (*out5)
+   : "=a" (out0), "=b" (rbx), "=c" (rcx),
+   "=d" (rdx), "=S" (rsi), "=D" (rdi)
: "a" (VMWARE_HYPERVISOR_MAGIC),
"b" (in1),
"c" (cmd),
@@ -101,8 +101,8 @@ unsigned long vmware_hypercall_slow(unsigned long cmd,
break;
default:
asm_inline volatile ("movw %[port], %%dx; inl (%%dx), %%eax"
-   : "=a" (out0), "=b" (*out1), "=c" (*out2),
-   "=d" (*out3), "=S" (*out4), "=D" (*out5)
+   : "=a" (out0), "=b" (rbx), "=c" (rcx),
+   "=d" (rdx), "=S" (rsi), "=D" (rdi)
: [port] "i" (VMWARE_HYPERVISOR_PORT),
"a" (VMWARE_HYPERVISOR_MAGIC),
"b" (in1),
@@ -113,6 +113,18 @@ unsigned long vmware_hypercall_slow(unsigned long cmd,
: "cc", "memory");
break;
}
+
+   if (out1)
+   *out1 = rbx;
+   if (out2)
+   *out2 = rcx;
+   if (out3)
+   *out3 = rdx;
+   if (out4)
+   *out4 = rsi;
+   if (out5)
+   *out5 = rdi;
+
return out0;
 }
 
-- 
2.39.4




Re: [PATCH V2 3/3] virtio-net: synchronize operstate with admin state on up/down

2024-06-25 Thread Michael S. Tsirkin
On Tue, Jun 25, 2024 at 04:11:05PM +0800, Jason Wang wrote:
> On Tue, Jun 25, 2024 at 3:57 PM Michael S. Tsirkin  wrote:
> >
> > On Tue, Jun 25, 2024 at 03:46:44PM +0800, Jason Wang wrote:
> > > Workqueue is used to serialize those so we won't lose any change.
> >
> > So we don't need to re-read then?
> >
> 
> We might have to re-read but I don't get why it is a problem for us.
> 
> Thanks

I don't think each ethtool command should force a full config read,
is what I mean. Only do it if really needed.

-- 
MST




Re: [PATCH V2 1/3] virtio: allow nested disabling of the configure interrupt

2024-06-25 Thread Michael S. Tsirkin
On Tue, Jun 25, 2024 at 04:18:00PM +0800, Jason Wang wrote:
> > > >
> > > >
> > > >
> > > > But in conclusion ;) if you don't like my suggestion do something else
> > > > but make the APIs make sense,
> > >
> > > I don't say I don't like it:)
> > >
> > > Limiting it to virtio-net seems to be the most easy way. And if we
> > > want to do it in the core, I just want to make nesting to be supported
> > > which might not be necessary now.
> >
> > I feel limiting it to a single driver strikes the right balance ATM.
> 
> Just to make sure I understand here, should we go back to v1 or go
> with the config_driver_disabled?
> 
> Thanks


I still like config_driver_disabled.


> >
> > >
> > > > at least do better than +5
> > > > on Rusty's interface design scale.
> > > >
> > > > >
> > >
> > > Thanks
> > >
> > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > > @@ -455,7 +461,7 @@ int register_virtio_device(struct 
> > > > > > > virtio_device *dev)
> > > > > > >   goto out_ida_remove;
> > > > > > >
> > > > > > >   spin_lock_init(&dev->config_lock);
> > > > > > > - dev->config_enabled = false;
> > > > > > > + dev->config_enabled = 0;
> > > > > > >   dev->config_change_pending = false;
> > > > > > >
> > > > > > >   INIT_LIST_HEAD(&dev->vqs);
> > > > > > > diff --git a/include/linux/virtio.h b/include/linux/virtio.h
> > > > > > > index 96fea920873b..4496f9ba5d82 100644
> > > > > > > --- a/include/linux/virtio.h
> > > > > > > +++ b/include/linux/virtio.h
> > > > > > > @@ -132,7 +132,7 @@ struct virtio_admin_cmd {
> > > > > > >  struct virtio_device {
> > > > > > >   int index;
> > > > > > >   bool failed;
> > > > > > > - bool config_enabled;
> > > > > > > + int config_enabled;
> > > > > > >   bool config_change_pending;
> > > > > > >   spinlock_t config_lock;
> > > > > > >   spinlock_t vqs_list_lock;
> > > > > > > --
> > > > > > > 2.31.1
> > > > > >
> > > >
> >




Re: [PATCH V2 1/3] virtio: allow nested disabling of the configure interrupt

2024-06-25 Thread Jason Wang
On Tue, Jun 25, 2024 at 4:04 PM Michael S. Tsirkin  wrote:
>
> On Tue, Jun 25, 2024 at 03:50:30PM +0800, Jason Wang wrote:
> > On Tue, Jun 25, 2024 at 3:11 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Jun 25, 2024 at 09:27:04AM +0800, Jason Wang wrote:
> > > > On Mon, Jun 24, 2024 at 5:59 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Mon, Jun 24, 2024 at 10:45:21AM +0800, Jason Wang wrote:
> > > > > > Somtime driver may want to enable or disable the config callback. 
> > > > > > This
> > > > > > requires a synchronization with the core. So this patch change the
> > > > > > config_enabled to be a integer counter. This allows the toggling of
> > > > > > the config_enable to be synchronized between the virtio core and the
> > > > > > virtio driver.
> > > > > >
> > > > > > The counter is not allowed to be increased greater than one, this
> > > > > > simplifies the logic where the interrupt could be disabled 
> > > > > > immediately
> > > > > > without extra synchronization between driver and core.
> > > > > >
> > > > > > Signed-off-by: Jason Wang 
> > > > > > ---
> > > > > >  drivers/virtio/virtio.c | 20 +---
> > > > > >  include/linux/virtio.h  |  2 +-
> > > > > >  2 files changed, 14 insertions(+), 8 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> > > > > > index b968b2aa5f4d..d3aa74b8ae5d 100644
> > > > > > --- a/drivers/virtio/virtio.c
> > > > > > +++ b/drivers/virtio/virtio.c
> > > > > > @@ -127,7 +127,7 @@ static void __virtio_config_changed(struct 
> > > > > > virtio_device *dev)
> > > > > >  {
> > > > > >   struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
> > > > > >
> > > > > > - if (!dev->config_enabled)
> > > > > > + if (dev->config_enabled < 1)
> > > > > >   dev->config_change_pending = true;
> > > > > >   else if (drv && drv->config_changed)
> > > > > >   drv->config_changed(dev);
> > > > > > @@ -146,17 +146,23 @@ EXPORT_SYMBOL_GPL(virtio_config_changed);
> > > > > >  static void virtio_config_disable(struct virtio_device *dev)
> > > > > >  {
> > > > > >   spin_lock_irq(&dev->config_lock);
> > > > > > - dev->config_enabled = false;
> > > > > > + --dev->config_enabled;
> > > > > >   spin_unlock_irq(&dev->config_lock);
> > > > > >  }
> > > > > >
> > > > > >  static void virtio_config_enable(struct virtio_device *dev)
> > > > > >  {
> > > > > >   spin_lock_irq(&dev->config_lock);
> > > > > > - dev->config_enabled = true;
> > > > > > - if (dev->config_change_pending)
> > > > > > - __virtio_config_changed(dev);
> > > > > > - dev->config_change_pending = false;
> > > > > > +
> > > > > > + if (dev->config_enabled < 1) {
> > > > > > + ++dev->config_enabled;
> > > > > > + if (dev->config_enabled == 1 &&
> > > > > > + dev->config_change_pending) {
> > > > > > + __virtio_config_changed(dev);
> > > > > > + dev->config_change_pending = false;
> > > > > > + }
> > > > > > + }
> > > > > > +
> > > > > >   spin_unlock_irq(&dev->config_lock);
> > > > > >  }
> > > > > >
> > > > >
> > > > > So every disable decrements the counter. Enable only increments it up 
> > > > > to 1.
> > > > > You seem to be making some very specific assumptions
> > > > > about how this API will be used. Any misuse will lead to 
> > > > > under/overflow
> > > > > eventually ...
> > > > >
> > > >
> > > > Well, a counter gives us more information than a boolean. With
> > > > boolean, misuse is even harder to be noticed.
> > >
> > > With boolean we can prevent misuse easily because previous state
> > > is known exactly. E.g.:
> > >
> > > static void virtio_config_driver_disable(struct virtio_device *dev)
> > > {
> > > BUG_ON(dev->config_driver_disabled);
> > > dev->config_driver_disabled = true;
> > > }
> > >
> > >
> > >
> > > static void virtio_config_driver_enable(struct virtio_device *dev)
> > > {
> > > BUG_ON(!dev->config_driver_disabled);
> > > dev->config_driver_disabled = false;
> > > }
> > >
> > >
> > > Does not work with integer you simply have no idea what the value
> > > should be at point of call.
> >
> > Yes but I meant if we want the config could be disabled by different
> > parties (core, driver and others)
>
> For now, we don't have others ;)
>
> > >
> > >
> > > > >
> > > > >
> > > > > My suggestion would be to
> > > > > 1. rename config_enabled to config_core_enabled
> > > > > 2. rename virtio_config_enable/disable to 
> > > > > virtio_config_core_enable/disable
> > > > > 3. add bool config_driver_disabled and make 
> > > > > virtio_config_enable/disable
> > > > >switch that.
> > > > > 4. Change logic from dev->config_enabled to
> > > > >dev->config_core_enabled && !dev->config_driver_disabled
> > > >
> > > > If we make config_driver_disabled by default true,
> > >
> > > No, we make it false by default.
>

Re: [PATCH V2 3/3] virtio-net: synchronize operstate with admin state on up/down

2024-06-25 Thread Jason Wang
On Tue, Jun 25, 2024 at 3:57 PM Michael S. Tsirkin  wrote:
>
> On Tue, Jun 25, 2024 at 03:46:44PM +0800, Jason Wang wrote:
> > Workqueue is used to serialize those so we won't lose any change.
>
> So we don't need to re-read then?
>

We might have to re-read but I don't get why it is a problem for us.

Thanks




Re: [PATCH] vfs: Add AT_EMPTY_PATH_NOCHECK as unchecked AT_EMPTY_PATH

2024-06-25 Thread Christian Brauner
On Sat, Jun 22, 2024 at 03:41:19PM GMT, Linus Torvalds wrote:
> On Sat, 22 Jun 2024 at 14:25, Mateusz Guzik  wrote:
> >
> > +cc Linus
> 
> Thanks.
> 
> > To sum up the problem: stat and statx met with "" + AT_EMPTY_PATH have
> > more work to do than fstat and its hypotethical statx counterpart:
> > - buf alloc/free for the path
> > - userspace access (very painful on x86_64 + SMAP)
> > - lockref acquire/release
> 
> Yes. That LOOKUP_EMPTY_NOCHECK is *not* the fix.

I have explicitly NAKed an approach like this months ago in [1]. So I'm
not sure why we're getting that patch in the first place tbh.

[1]: 
https://lore.kernel.org/lkml/599df4a3-47a4-49be-9c81-8e21ea1f9...@xen0n.name



Re: [PATCH V2 1/3] virtio: allow nested disabling of the configure interrupt

2024-06-25 Thread Michael S. Tsirkin
On Tue, Jun 25, 2024 at 03:50:30PM +0800, Jason Wang wrote:
> On Tue, Jun 25, 2024 at 3:11 PM Michael S. Tsirkin  wrote:
> >
> > On Tue, Jun 25, 2024 at 09:27:04AM +0800, Jason Wang wrote:
> > > On Mon, Jun 24, 2024 at 5:59 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Mon, Jun 24, 2024 at 10:45:21AM +0800, Jason Wang wrote:
> > > > > Somtime driver may want to enable or disable the config callback. This
> > > > > requires a synchronization with the core. So this patch change the
> > > > > config_enabled to be a integer counter. This allows the toggling of
> > > > > the config_enable to be synchronized between the virtio core and the
> > > > > virtio driver.
> > > > >
> > > > > The counter is not allowed to be increased greater than one, this
> > > > > simplifies the logic where the interrupt could be disabled immediately
> > > > > without extra synchronization between driver and core.
> > > > >
> > > > > Signed-off-by: Jason Wang 
> > > > > ---
> > > > >  drivers/virtio/virtio.c | 20 +---
> > > > >  include/linux/virtio.h  |  2 +-
> > > > >  2 files changed, 14 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> > > > > index b968b2aa5f4d..d3aa74b8ae5d 100644
> > > > > --- a/drivers/virtio/virtio.c
> > > > > +++ b/drivers/virtio/virtio.c
> > > > > @@ -127,7 +127,7 @@ static void __virtio_config_changed(struct 
> > > > > virtio_device *dev)
> > > > >  {
> > > > >   struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
> > > > >
> > > > > - if (!dev->config_enabled)
> > > > > + if (dev->config_enabled < 1)
> > > > >   dev->config_change_pending = true;
> > > > >   else if (drv && drv->config_changed)
> > > > >   drv->config_changed(dev);
> > > > > @@ -146,17 +146,23 @@ EXPORT_SYMBOL_GPL(virtio_config_changed);
> > > > >  static void virtio_config_disable(struct virtio_device *dev)
> > > > >  {
> > > > >   spin_lock_irq(&dev->config_lock);
> > > > > - dev->config_enabled = false;
> > > > > + --dev->config_enabled;
> > > > >   spin_unlock_irq(&dev->config_lock);
> > > > >  }
> > > > >
> > > > >  static void virtio_config_enable(struct virtio_device *dev)
> > > > >  {
> > > > >   spin_lock_irq(&dev->config_lock);
> > > > > - dev->config_enabled = true;
> > > > > - if (dev->config_change_pending)
> > > > > - __virtio_config_changed(dev);
> > > > > - dev->config_change_pending = false;
> > > > > +
> > > > > + if (dev->config_enabled < 1) {
> > > > > + ++dev->config_enabled;
> > > > > + if (dev->config_enabled == 1 &&
> > > > > + dev->config_change_pending) {
> > > > > + __virtio_config_changed(dev);
> > > > > + dev->config_change_pending = false;
> > > > > + }
> > > > > + }
> > > > > +
> > > > >   spin_unlock_irq(&dev->config_lock);
> > > > >  }
> > > > >
> > > >
> > > > So every disable decrements the counter. Enable only increments it up 
> > > > to 1.
> > > > You seem to be making some very specific assumptions
> > > > about how this API will be used. Any misuse will lead to under/overflow
> > > > eventually ...
> > > >
> > >
> > > Well, a counter gives us more information than a boolean. With
> > > boolean, misuse is even harder to be noticed.
> >
> > With boolean we can prevent misuse easily because previous state
> > is known exactly. E.g.:
> >
> > static void virtio_config_driver_disable(struct virtio_device *dev)
> > {
> > BUG_ON(dev->config_driver_disabled);
> > dev->config_driver_disabled = true;
> > }
> >
> >
> >
> > static void virtio_config_driver_enable(struct virtio_device *dev)
> > {
> > BUG_ON(!dev->config_driver_disabled);
> > dev->config_driver_disabled = false;
> > }
> >
> >
> > Does not work with integer you simply have no idea what the value
> > should be at point of call.
> 
> Yes but I meant if we want the config could be disabled by different
> parties (core, driver and others)

For now, we don't have others ;)

> >
> >
> > > >
> > > >
> > > > My suggestion would be to
> > > > 1. rename config_enabled to config_core_enabled
> > > > 2. rename virtio_config_enable/disable to 
> > > > virtio_config_core_enable/disable
> > > > 3. add bool config_driver_disabled and make virtio_config_enable/disable
> > > >switch that.
> > > > 4. Change logic from dev->config_enabled to
> > > >dev->config_core_enabled && !dev->config_driver_disabled
> > >
> > > If we make config_driver_disabled by default true,
> >
> > No, we make it false by default.
> >
> > > we need someone to
> > > enable it explicitly. If it's core, it breaks the semantic that it is
> > > under the control of the driver (or needs to synchronize with the
> > > driver). If it's a driver, each driver needs to enable it at some time
> > > which can be easily forgotten. And if we end up with workarounds like:
> > >
> > > 

Re: [PATCH V2 3/3] virtio-net: synchronize operstate with admin state on up/down

2024-06-25 Thread Michael S. Tsirkin
On Tue, Jun 25, 2024 at 03:46:44PM +0800, Jason Wang wrote:
> Workqueue is used to serialize those so we won't lose any change.

So we don't need to re-read then?




Re: [PATCH V2 1/3] virtio: allow nested disabling of the configure interrupt

2024-06-25 Thread Jason Wang
On Tue, Jun 25, 2024 at 3:11 PM Michael S. Tsirkin  wrote:
>
> On Tue, Jun 25, 2024 at 09:27:04AM +0800, Jason Wang wrote:
> > On Mon, Jun 24, 2024 at 5:59 PM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Jun 24, 2024 at 10:45:21AM +0800, Jason Wang wrote:
> > > > Somtime driver may want to enable or disable the config callback. This
> > > > requires a synchronization with the core. So this patch change the
> > > > config_enabled to be a integer counter. This allows the toggling of
> > > > the config_enable to be synchronized between the virtio core and the
> > > > virtio driver.
> > > >
> > > > The counter is not allowed to be increased greater than one, this
> > > > simplifies the logic where the interrupt could be disabled immediately
> > > > without extra synchronization between driver and core.
> > > >
> > > > Signed-off-by: Jason Wang 
> > > > ---
> > > >  drivers/virtio/virtio.c | 20 +---
> > > >  include/linux/virtio.h  |  2 +-
> > > >  2 files changed, 14 insertions(+), 8 deletions(-)
> > > >
> > > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> > > > index b968b2aa5f4d..d3aa74b8ae5d 100644
> > > > --- a/drivers/virtio/virtio.c
> > > > +++ b/drivers/virtio/virtio.c
> > > > @@ -127,7 +127,7 @@ static void __virtio_config_changed(struct 
> > > > virtio_device *dev)
> > > >  {
> > > >   struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
> > > >
> > > > - if (!dev->config_enabled)
> > > > + if (dev->config_enabled < 1)
> > > >   dev->config_change_pending = true;
> > > >   else if (drv && drv->config_changed)
> > > >   drv->config_changed(dev);
> > > > @@ -146,17 +146,23 @@ EXPORT_SYMBOL_GPL(virtio_config_changed);
> > > >  static void virtio_config_disable(struct virtio_device *dev)
> > > >  {
> > > >   spin_lock_irq(&dev->config_lock);
> > > > - dev->config_enabled = false;
> > > > + --dev->config_enabled;
> > > >   spin_unlock_irq(&dev->config_lock);
> > > >  }
> > > >
> > > >  static void virtio_config_enable(struct virtio_device *dev)
> > > >  {
> > > >   spin_lock_irq(&dev->config_lock);
> > > > - dev->config_enabled = true;
> > > > - if (dev->config_change_pending)
> > > > - __virtio_config_changed(dev);
> > > > - dev->config_change_pending = false;
> > > > +
> > > > + if (dev->config_enabled < 1) {
> > > > + ++dev->config_enabled;
> > > > + if (dev->config_enabled == 1 &&
> > > > + dev->config_change_pending) {
> > > > + __virtio_config_changed(dev);
> > > > + dev->config_change_pending = false;
> > > > + }
> > > > + }
> > > > +
> > > >   spin_unlock_irq(&dev->config_lock);
> > > >  }
> > > >
> > >
> > > So every disable decrements the counter. Enable only increments it up to 
> > > 1.
> > > You seem to be making some very specific assumptions
> > > about how this API will be used. Any misuse will lead to under/overflow
> > > eventually ...
> > >
> >
> > Well, a counter gives us more information than a boolean. With
> > boolean, misuse is even harder to be noticed.
>
> With boolean we can prevent misuse easily because previous state
> is known exactly. E.g.:
>
> static void virtio_config_driver_disable(struct virtio_device *dev)
> {
> BUG_ON(dev->config_driver_disabled);
> dev->config_driver_disabled = true;
> }
>
>
>
> static void virtio_config_driver_enable(struct virtio_device *dev)
> {
> BUG_ON(!dev->config_driver_disabled);
> dev->config_driver_disabled = false;
> }
>
>
> Does not work with integer you simply have no idea what the value
> should be at point of call.

Yes but I meant if we want the config could be disabled by different
parties (core, driver and others)

>
>
> > >
> > >
> > > My suggestion would be to
> > > 1. rename config_enabled to config_core_enabled
> > > 2. rename virtio_config_enable/disable to 
> > > virtio_config_core_enable/disable
> > > 3. add bool config_driver_disabled and make virtio_config_enable/disable
> > >switch that.
> > > 4. Change logic from dev->config_enabled to
> > >dev->config_core_enabled && !dev->config_driver_disabled
> >
> > If we make config_driver_disabled by default true,
>
> No, we make it false by default.
>
> > we need someone to
> > enable it explicitly. If it's core, it breaks the semantic that it is
> > under the control of the driver (or needs to synchronize with the
> > driver). If it's a driver, each driver needs to enable it at some time
> > which can be easily forgotten. And if we end up with workarounds like:
> >
> > /* If probe didn't do it, mark device DRIVER_OK ourselves. */
> > if (!(dev->config->get_status(dev) & VIRTIO_CONFIG_S_DRIVER_OK))
> > virtio_device_ready(dev);
> >
> > It's another break of the semantics. And actually the above is also racy.
> >
> > It seems the only choice is to make config_driver_disabled by defaul

Re: [PATCH V2 3/3] virtio-net: synchronize operstate with admin state on up/down

2024-06-25 Thread Jason Wang
On Tue, Jun 25, 2024 at 3:16 PM Michael S. Tsirkin  wrote:
>
> On Tue, Jun 25, 2024 at 09:27:38AM +0800, Jason Wang wrote:
> > On Mon, Jun 24, 2024 at 6:07 PM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Jun 24, 2024 at 10:45:23AM +0800, Jason Wang wrote:
> > > > This patch synchronize operstate with admin state per RFC2863.
> > > >
> > > > This is done by trying to toggle the carrier upon open/close and
> > > > synchronize with the config change work. This allows propagate status
> > > > correctly to stacked devices like:
> > > >
> > > > ip link add link enp0s3 macvlan0 type macvlan
> > > > ip link set link enp0s3 down
> > > > ip link show
> > > >
> > > > Before this patch:
> > > >
> > > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN 
> > > > mode DEFAULT group default qlen 1000
> > > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > > ..
> > > > 5: macvlan0@enp0s3:  mtu 1500 
> > > > qdisc noqueue state UP mode DEFAULT group default qlen 1000
> > > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > > >
> > > > After this patch:
> > > >
> > > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN 
> > > > mode DEFAULT group default qlen 1000
> > > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > > ...
> > > > 5: macvlan0@enp0s3:  mtu 1500 
> > > > qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
> > > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > > >
> > > > Cc: Venkat Venkatsubra 
> > > > Cc: Gia-Khanh Nguyen 
> > > > Signed-off-by: Jason Wang 
> > > > ---
> > > >  drivers/net/virtio_net.c | 72 +++-
> > > >  1 file changed, 42 insertions(+), 30 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index b1f8b720733e..eff3ad3d6bcc 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -2468,6 +2468,25 @@ static int virtnet_enable_queue_pair(struct 
> > > > virtnet_info *vi, int qp_index)
> > > >   return err;
> > > >  }
> > > >
> > > > +static void virtnet_update_settings(struct virtnet_info *vi)
> > > > +{
> > > > + u32 speed;
> > > > + u8 duplex;
> > > > +
> > > > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX))
> > > > + return;
> > > > +
> > > > + virtio_cread_le(vi->vdev, struct virtio_net_config, speed, 
> > > > &speed);
> > > > +
> > > > + if (ethtool_validate_speed(speed))
> > > > + vi->speed = speed;
> > > > +
> > > > + virtio_cread_le(vi->vdev, struct virtio_net_config, duplex, 
> > > > &duplex);
> > > > +
> > > > + if (ethtool_validate_duplex(duplex))
> > > > + vi->duplex = duplex;
> > > > +}
> > > > +
> > > >  static int virtnet_open(struct net_device *dev)
> > > >  {
> > > >   struct virtnet_info *vi = netdev_priv(dev);
> > > > @@ -2486,6 +2505,22 @@ static int virtnet_open(struct net_device *dev)
> > > >   goto err_enable_qp;
> > > >   }
> > > >
> > > > + /* Assume link up if device can't report link status,
> > > > +otherwise get link status from config. */
> > > > + netif_carrier_off(dev);
> > > > + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
> > > > + virtio_config_enable(vi->vdev);
> > > > + /* We are not sure if config interrupt is disabled by
> > > > +  * core or not, so we can't schedule config_work by
> > > > +  * ourselves.
> > > > +  */
> > >
> > > This comment confuses more than it explains.
> > > You seem to be arguing about some alternative design
> > > you had in mind, but readers don't have it in mind.
> > >
> > >
> > > Please just explain what this does and why.
> > > For what: something like "Trigger re-read of config - same
> > > as we'd do if config changed".
> > >
> > > Now, please do what you don't do here: explain the why:
> > >
> > >
> > > why do we want all these VM
> > > exits on each open/close as opposed to once on probe and later on
> > > config changed interrupt.
> >
> > Fine, the main reason is that a config interrupt might be pending
> > during ifdown and core may disable configure interrupt due to several
> > reasons.
> >
> > Thanks
>
> If the config changes exactly as command is executing?
> Then we'll get an interrupt later and update.

Workqueue is used to serialize those so we won't lose any change.

Currently the interrupt is only used to:

1) update link status, so according to rfc2863, no need to do that
when the interface is admin down
2) announce the link, it's also meaningless to announce the link when
the interface is down.

Or anything I miss here?

> You can't always win this race, even if you read it can
> change right after.

Thanks

>
>
> >
> > >
> > >
> > > > + virtio_config_changed(vi->vdev);
> > > > + } else {
> > > > + vi->status = VIRTIO_NET_S_LINK_UP;
> > > > + virtnet_update_settings(vi);
> > > >

Re: [PATCH V2 3/3] virtio-net: synchronize operstate with admin state on up/down

2024-06-25 Thread Michael S. Tsirkin
On Tue, Jun 25, 2024 at 09:27:38AM +0800, Jason Wang wrote:
> On Mon, Jun 24, 2024 at 6:07 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Jun 24, 2024 at 10:45:23AM +0800, Jason Wang wrote:
> > > This patch synchronize operstate with admin state per RFC2863.
> > >
> > > This is done by trying to toggle the carrier upon open/close and
> > > synchronize with the config change work. This allows propagate status
> > > correctly to stacked devices like:
> > >
> > > ip link add link enp0s3 macvlan0 type macvlan
> > > ip link set link enp0s3 down
> > > ip link show
> > >
> > > Before this patch:
> > >
> > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN 
> > > mode DEFAULT group default qlen 1000
> > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > ..
> > > 5: macvlan0@enp0s3:  mtu 1500 
> > > qdisc noqueue state UP mode DEFAULT group default qlen 1000
> > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > >
> > > After this patch:
> > >
> > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN 
> > > mode DEFAULT group default qlen 1000
> > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > ...
> > > 5: macvlan0@enp0s3:  mtu 1500 
> > > qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
> > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > >
> > > Cc: Venkat Venkatsubra 
> > > Cc: Gia-Khanh Nguyen 
> > > Signed-off-by: Jason Wang 
> > > ---
> > >  drivers/net/virtio_net.c | 72 +++-
> > >  1 file changed, 42 insertions(+), 30 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index b1f8b720733e..eff3ad3d6bcc 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -2468,6 +2468,25 @@ static int virtnet_enable_queue_pair(struct 
> > > virtnet_info *vi, int qp_index)
> > >   return err;
> > >  }
> > >
> > > +static void virtnet_update_settings(struct virtnet_info *vi)
> > > +{
> > > + u32 speed;
> > > + u8 duplex;
> > > +
> > > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX))
> > > + return;
> > > +
> > > + virtio_cread_le(vi->vdev, struct virtio_net_config, speed, &speed);
> > > +
> > > + if (ethtool_validate_speed(speed))
> > > + vi->speed = speed;
> > > +
> > > + virtio_cread_le(vi->vdev, struct virtio_net_config, duplex, 
> > > &duplex);
> > > +
> > > + if (ethtool_validate_duplex(duplex))
> > > + vi->duplex = duplex;
> > > +}
> > > +
> > >  static int virtnet_open(struct net_device *dev)
> > >  {
> > >   struct virtnet_info *vi = netdev_priv(dev);
> > > @@ -2486,6 +2505,22 @@ static int virtnet_open(struct net_device *dev)
> > >   goto err_enable_qp;
> > >   }
> > >
> > > + /* Assume link up if device can't report link status,
> > > +otherwise get link status from config. */
> > > + netif_carrier_off(dev);
> > > + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
> > > + virtio_config_enable(vi->vdev);
> > > + /* We are not sure if config interrupt is disabled by
> > > +  * core or not, so we can't schedule config_work by
> > > +  * ourselves.
> > > +  */
> >
> > This comment confuses more than it explains.
> > You seem to be arguing about some alternative design
> > you had in mind, but readers don't have it in mind.
> >
> >
> > Please just explain what this does and why.
> > For what: something like "Trigger re-read of config - same
> > as we'd do if config changed".
> >
> > Now, please do what you don't do here: explain the why:
> >
> >
> > why do we want all these VM
> > exits on each open/close as opposed to once on probe and later on
> > config changed interrupt.
> 
> Fine, the main reason is that a config interrupt might be pending
> during ifdown and core may disable configure interrupt due to several
> reasons.
> 
> Thanks

If the config changes exactly as command is executing?
Then we'll get an interrupt later and update.
You can't always win this race, even if you read it can
change right after.


> 
> >
> >
> > > + virtio_config_changed(vi->vdev);
> > > + } else {
> > > + vi->status = VIRTIO_NET_S_LINK_UP;
> > > + virtnet_update_settings(vi);
> > > + netif_carrier_on(dev);
> > > + }
> > > +
> > >   return 0;
> > >
> > >  err_enable_qp:
> > > @@ -2928,12 +2963,19 @@ static int virtnet_close(struct net_device *dev)
> > >   disable_delayed_refill(vi);
> > >   /* Make sure refill_work doesn't re-enable napi! */
> > >   cancel_delayed_work_sync(&vi->refill);
> > > + /* Make sure config notification doesn't schedule config work */
> > > + virtio_config_disable(vi->vdev);
> > > + /* Make sure status updating is cancelled */
> > > + cancel_work_sync(&vi->config_work);
> > >
> > >   for (i = 0; i < vi->max_queue_pairs; i++) {
> > >   

Re: [PATCH v9 4/8] remoteproc: qcom: Add ssr subdevice identifier

2024-06-25 Thread Dmitry Baryshkov
On Tue, Jun 25, 2024 at 09:04:17AM GMT, Krzysztof Kozlowski wrote:
> On 25/06/2024 08:28, Gokul Sriram P wrote:
> > 
> > On 6/21/2024 8:40 PM, Krzysztof Kozlowski wrote:
> >> On 21/06/2024 13:46, Gokul Sriram Palanisamy wrote:
> >>> Add name for ssr subdevice on IPQ8074 SoC.
> >> Why?
> >    Oops! Missed the change. Will add and update.
> >>> Signed-off-by: Nikhil Prakash V
> >>> Signed-off-by: Sricharan R
> >>> Signed-off-by: Gokul Sriram Palanisamy
> >> Three people developed that single line?
> >>
> >> Something is really odd with your DCO chain.
> >   The change was originally authored by Nikhil and reviewed by 
> > Sricharan. I'm just submitting the change to upstream so retained their 
> > names.
> >>
> 
> Then your DCO chain is not correct. Please carefully read submitting
> patches, especially documents about authorship, DCO, reviewed tags.

Also there should be From: Nikhil header before the patch.

-- 
With best wishes
Dmitry



Re: [PATCH V2 1/3] virtio: allow nested disabling of the configure interrupt

2024-06-25 Thread Michael S. Tsirkin
On Tue, Jun 25, 2024 at 09:27:04AM +0800, Jason Wang wrote:
> On Mon, Jun 24, 2024 at 5:59 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Jun 24, 2024 at 10:45:21AM +0800, Jason Wang wrote:
> > > Somtime driver may want to enable or disable the config callback. This
> > > requires a synchronization with the core. So this patch change the
> > > config_enabled to be a integer counter. This allows the toggling of
> > > the config_enable to be synchronized between the virtio core and the
> > > virtio driver.
> > >
> > > The counter is not allowed to be increased greater than one, this
> > > simplifies the logic where the interrupt could be disabled immediately
> > > without extra synchronization between driver and core.
> > >
> > > Signed-off-by: Jason Wang 
> > > ---
> > >  drivers/virtio/virtio.c | 20 +---
> > >  include/linux/virtio.h  |  2 +-
> > >  2 files changed, 14 insertions(+), 8 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
> > > index b968b2aa5f4d..d3aa74b8ae5d 100644
> > > --- a/drivers/virtio/virtio.c
> > > +++ b/drivers/virtio/virtio.c
> > > @@ -127,7 +127,7 @@ static void __virtio_config_changed(struct 
> > > virtio_device *dev)
> > >  {
> > >   struct virtio_driver *drv = drv_to_virtio(dev->dev.driver);
> > >
> > > - if (!dev->config_enabled)
> > > + if (dev->config_enabled < 1)
> > >   dev->config_change_pending = true;
> > >   else if (drv && drv->config_changed)
> > >   drv->config_changed(dev);
> > > @@ -146,17 +146,23 @@ EXPORT_SYMBOL_GPL(virtio_config_changed);
> > >  static void virtio_config_disable(struct virtio_device *dev)
> > >  {
> > >   spin_lock_irq(&dev->config_lock);
> > > - dev->config_enabled = false;
> > > + --dev->config_enabled;
> > >   spin_unlock_irq(&dev->config_lock);
> > >  }
> > >
> > >  static void virtio_config_enable(struct virtio_device *dev)
> > >  {
> > >   spin_lock_irq(&dev->config_lock);
> > > - dev->config_enabled = true;
> > > - if (dev->config_change_pending)
> > > - __virtio_config_changed(dev);
> > > - dev->config_change_pending = false;
> > > +
> > > + if (dev->config_enabled < 1) {
> > > + ++dev->config_enabled;
> > > + if (dev->config_enabled == 1 &&
> > > + dev->config_change_pending) {
> > > + __virtio_config_changed(dev);
> > > + dev->config_change_pending = false;
> > > + }
> > > + }
> > > +
> > >   spin_unlock_irq(&dev->config_lock);
> > >  }
> > >
> >
> > So every disable decrements the counter. Enable only increments it up to 1.
> > You seem to be making some very specific assumptions
> > about how this API will be used. Any misuse will lead to under/overflow
> > eventually ...
> >
> 
> Well, a counter gives us more information than a boolean. With
> boolean, misuse is even harder to be noticed.

With boolean we can prevent misuse easily because previous state
is known exactly. E.g.:

static void virtio_config_driver_disable(struct virtio_device *dev)
{
BUG_ON(dev->config_driver_disabled);
dev->config_driver_disabled = true;
}



static void virtio_config_driver_enable(struct virtio_device *dev)
{
BUG_ON(!dev->config_driver_disabled);
dev->config_driver_disabled = false;
}


Does not work with integer you simply have no idea what the value
should be at point of call.


> >
> >
> > My suggestion would be to
> > 1. rename config_enabled to config_core_enabled
> > 2. rename virtio_config_enable/disable to virtio_config_core_enable/disable
> > 3. add bool config_driver_disabled and make virtio_config_enable/disable
> >switch that.
> > 4. Change logic from dev->config_enabled to
> >dev->config_core_enabled && !dev->config_driver_disabled
> 
> If we make config_driver_disabled by default true,

No, we make it false by default.

> we need someone to
> enable it explicitly. If it's core, it breaks the semantic that it is
> under the control of the driver (or needs to synchronize with the
> driver). If it's a driver, each driver needs to enable it at some time
> which can be easily forgotten. And if we end up with workarounds like:
> 
> /* If probe didn't do it, mark device DRIVER_OK ourselves. */
> if (!(dev->config->get_status(dev) & VIRTIO_CONFIG_S_DRIVER_OK))
> virtio_device_ready(dev);
> 
> It's another break of the semantics. And actually the above is also racy.
> 
> It seems the only choice is to make config_driver_disabled by default
> false. But the driver needs to be aware of this and take extra care
> when calling virtio_device_ready() which is also tricky.


No, false by default simply means no change to semantics.


> 
> So in conclusion, two booleans seems sut-optimal than a counter. For
> example we can use different bits for the counter as preempt_count
> did. With counter(s), core and driver don't need any implicit/explicit

Re: [PATCH v9 4/8] remoteproc: qcom: Add ssr subdevice identifier

2024-06-25 Thread Krzysztof Kozlowski
On 25/06/2024 08:28, Gokul Sriram P wrote:
> 
> On 6/21/2024 8:40 PM, Krzysztof Kozlowski wrote:
>> On 21/06/2024 13:46, Gokul Sriram Palanisamy wrote:
>>> Add name for ssr subdevice on IPQ8074 SoC.
>> Why?
>    Oops! Missed the change. Will add and update.
>>> Signed-off-by: Nikhil Prakash V
>>> Signed-off-by: Sricharan R
>>> Signed-off-by: Gokul Sriram Palanisamy
>> Three people developed that single line?
>>
>> Something is really odd with your DCO chain.
>   The change was originally authored by Nikhil and reviewed by 
> Sricharan. I'm just submitting the change to upstream so retained their 
> names.
>>

Then your DCO chain is not correct. Please carefully read submitting
patches, especially documents about authorship, DCO, reviewed tags.

Best regards,
Krzysztof