Re: [PATCH] hwspinlock/core: fix kernel-doc warnings

2023-12-15 Thread Bagas Sanjaya
On Tue, Dec 05, 2023 at 09:54:39PM -0800, Randy Dunlap wrote:
> diff -- a/drivers/hwspinlock/hwspinlock_core.c 
> b/drivers/hwspinlock/hwspinlock_core.c
> --- a/drivers/hwspinlock/hwspinlock_core.c
> +++ b/drivers/hwspinlock/hwspinlock_core.c
> @@ -84,8 +84,9 @@ static DEFINE_MUTEX(hwspinlock_tree_lock
>   * should decide between spin_trylock, spin_trylock_irq and
>   * spin_trylock_irqsave.
>   *
> - * Returns 0 if we successfully locked the hwspinlock or -EBUSY if
> + * Returns: %0 if we successfully locked the hwspinlock or -EBUSY if
>   * the hwspinlock was already taken.
> + *
>   * This function will never sleep.
>   */
>  int __hwspin_trylock(struct hwspinlock *hwlock, int mode, unsigned long 
> *flags)
> @@ -171,7 +172,7 @@ EXPORT_SYMBOL_GPL(__hwspin_trylock);
>  /**
>   * __hwspin_lock_timeout() - lock an hwspinlock with timeout limit
>   * @hwlock: the hwspinlock to be locked
> - * @timeout: timeout value in msecs
> + * @to: timeout value in msecs
>   * @mode: mode which controls whether local interrupts are disabled or not
>   * @flags: a pointer to where the caller's interrupt state will be saved at 
> (if
>   * requested)
> @@ -199,9 +200,11 @@ EXPORT_SYMBOL_GPL(__hwspin_trylock);
>   * to choose the appropriate @mode of operation, exactly the same way users
>   * should decide between spin_lock, spin_lock_irq and spin_lock_irqsave.
>   *
> - * Returns 0 when the @hwlock was successfully taken, and an appropriate
> + * Returns: %0 when the @hwlock was successfully taken, and an appropriate
>   * error code otherwise (most notably -ETIMEDOUT if the @hwlock is still
> - * busy after @timeout msecs). The function will never sleep.
> + * busy after @timeout msecs).
> + *
> + * The function will never sleep.
>   */
>  int __hwspin_lock_timeout(struct hwspinlock *hwlock, unsigned int to,
>   int mode, unsigned long *flags)
> @@ -304,13 +307,12 @@ EXPORT_SYMBOL_GPL(__hwspin_unlock);
>  
>  /**
>   * of_hwspin_lock_simple_xlate - translate hwlock_spec to return a lock id
> - * @bank: the hwspinlock device bank
>   * @hwlock_spec: hwlock specifier as found in the device tree
>   *
>   * This is a simple translation function, suitable for hwspinlock platform
>   * drivers that only has a lock specifier length of 1.
>   *
> - * Returns a relative index of the lock within a specified bank on success,
> + * Returns: a relative index of the lock within a specified bank on success,
>   * or -EINVAL on invalid specifier cell count.
>   */
>  static inline int
> @@ -332,9 +334,10 @@ of_hwspin_lock_simple_xlate(const struct
>   * hwspinlock device, so that it can be requested using the normal
>   * hwspin_lock_request_specific() API.
>   *
> - * Returns the global lock id number on success, -EPROBE_DEFER if the 
> hwspinlock
> - * device is not yet registered, -EINVAL on invalid args specifier value or 
> an
> - * appropriate error as returned from the OF parsing of the DT client node.
> + * Returns: the global lock id number on success, -EPROBE_DEFER if the
> + * hwspinlock device is not yet registered, -EINVAL on invalid args
> + * specifier value or an appropriate error as returned from the OF parsing
> + * of the DT client node.
>   */
>  int of_hwspin_lock_get_id(struct device_node *np, int index)
>  {
> @@ -399,9 +402,10 @@ EXPORT_SYMBOL_GPL(of_hwspin_lock_get_id)
>   * the hwspinlock device, so that it can be requested using the normal
>   * hwspin_lock_request_specific() API.
>   *
> - * Returns the global lock id number on success, -EPROBE_DEFER if the 
> hwspinlock
> - * device is not yet registered, -EINVAL on invalid args specifier value or 
> an
> - * appropriate error as returned from the OF parsing of the DT client node.
> + * Returns: the global lock id number on success, -EPROBE_DEFER if the
> + * hwspinlock device is not yet registered, -EINVAL on invalid args
> + * specifier value or an appropriate error as returned from the OF parsing
> + * of the DT client node.
>   */
>  int of_hwspin_lock_get_id_byname(struct device_node *np, const char *name)
>  {
> @@ -481,7 +485,7 @@ out:
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns 0 on success, or an appropriate error code on failure
> + * Returns: %0 on success, or an appropriate error code on failure
>   */
>  int hwspin_lock_register(struct hwspinlock_device *bank, struct device *dev,
>   const struct hwspinlock_ops *ops, int base_id, int num_locks)
> @@ -529,7 +533,7 @@ EXPORT_SYMBOL_GPL(hwspin_lock_register);
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns 0 on success, or an appropriate error code on failure
> + * Returns: %0 on success, or an appropriate error code on failure
>   */
>  int hwspin_lock_unregister(struct hwspinlock_device *bank)
>  {
> @@ -578,7 +582,7 @@ static int devm_hwspin_lock_device_match
>   *
>   * Should be called from a process context (might sleep)
>   *
> - * Returns 0 on success, 

Re: [PATCH v5 4/4] vduse: Add LSM hook to check Virtio device type

2023-12-15 Thread Serge E. Hallyn
On Tue, Dec 12, 2023 at 02:55:33PM -0800, Casey Schaufler wrote:
> On 12/12/2023 9:59 AM, Michael S. Tsirkin wrote:
> > On Tue, Dec 12, 2023 at 08:33:39AM -0800, Casey Schaufler wrote:
> >> On 12/12/2023 5:17 AM, Maxime Coquelin wrote:
> >>> This patch introduces a LSM hook for devices creation,
> >>> destruction (ioctl()) and opening (open()) operations,
> >>> checking the application is allowed to perform these
> >>> operations for the Virtio device type.
> >> My earlier comments on a vduse specific LSM hook still hold.
> >> I would much prefer to see a device permissions hook(s) that
> >> are useful for devices in general. Not just vduse devices.
> >> I know that there are already some very special purpose LSM
> >> hooks, but the experience with maintaining them is why I don't
> >> want more of them. 
> > What exactly does this mean?
> 
> You have proposed an LSM hook that is only useful for vduse.
> You want to implement a set of controls that only apply to vduse.
> I can't help but think that if someone (i.e. you) wants to control
> device creation for vduse that there could well be a use case for
> control over device creation for some other set of devices. It is
> quite possible that someone out there is desperately trying to
> solve the same problem you have, but with a different device.
> 
> I have no desire to have to deal with
>   security_vduse_perm_check()
>   security_odddev_perm_check()
>   ...
>   security_evendev_perm_check()
> 
> when we should be able to have
>   security_device_perm_check()
> 
> that can service them all.
> 
> 
> >  Devices like tap etc? How do we
> > find them all though?
> 
> I'm not suggesting you find them all. I'm suggesting that you provide
> an interface that someone could use if they wanted to. I think you
> will be surprised how many will appear (with complaints about the
> interface you propose, of course) if you implement a generally useful
> LSM hook.

Right now you have create, destroy, and open.  Are you expecting to add
other perms?  These sound generic enough that it definitely seems worth
doing as Casey suggests.  On the other hand, if this could become a
gateway to lsm device access hooks basically becoming ioctl, we might
want to consider that.

> >>> Signed-off-by: Maxime Coquelin 
> >>> ---
> >>>  MAINTAINERS |  1 +
> >>>  drivers/vdpa/vdpa_user/vduse_dev.c  | 13 
> >>>  include/linux/lsm_hook_defs.h   |  2 ++
> >>>  include/linux/security.h|  6 ++
> >>>  include/linux/vduse.h   | 14 +
> >>>  security/security.c | 15 ++
> >>>  security/selinux/hooks.c| 32 +
> >>>  security/selinux/include/classmap.h |  2 ++
> >>>  8 files changed, 85 insertions(+)
> >>>  create mode 100644 include/linux/vduse.h
> >>>
> >>> diff --git a/MAINTAINERS b/MAINTAINERS
> >>> index a0fb0df07b43..4e83b14358d2 100644
> >>> --- a/MAINTAINERS
> >>> +++ b/MAINTAINERS
> >>> @@ -23040,6 +23040,7 @@ F:drivers/net/virtio_net.c
> >>>  F:   drivers/vdpa/
> >>>  F:   drivers/virtio/
> >>>  F:   include/linux/vdpa.h
> >>> +F:   include/linux/vduse.h
> >>>  F:   include/linux/virtio*.h
> >>>  F:   include/linux/vringh.h
> >>>  F:   include/uapi/linux/virtio_*.h
> >>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> >>> b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> index fa62825be378..59ab7eb62e20 100644
> >>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>> @@ -8,6 +8,7 @@
> >>>   *
> >>>   */
> >>>  
> >>> +#include "linux/security.h"
> >>>  #include 
> >>>  #include 
> >>>  #include 
> >>> @@ -30,6 +31,7 @@
> >>>  #include 
> >>>  #include 
> >>>  #include 
> >>> +#include 
> >>>  
> >>>  #include "iova_domain.h"
> >>>  
> >>> @@ -1442,6 +1444,10 @@ static int vduse_dev_open(struct inode *inode, 
> >>> struct file *file)
> >>>   if (dev->connected)
> >>>   goto unlock;
> >>>  
> >>> + ret = -EPERM;
> >>> + if (security_vduse_perm_check(VDUSE_PERM_OPEN, dev->device_id))
> >>> + goto unlock;
> >>> +
> >>>   ret = 0;
> >>>   dev->connected = true;
> >>>   file->private_data = dev;
> >>> @@ -1664,6 +1670,9 @@ static int vduse_destroy_dev(char *name)
> >>>   if (!dev)
> >>>   return -EINVAL;
> >>>  
> >>> + if (security_vduse_perm_check(VDUSE_PERM_DESTROY, dev->device_id))
> >>> + return -EPERM;
> >>> +
> >>>   mutex_lock(>lock);
> >>>   if (dev->vdev || dev->connected) {
> >>>   mutex_unlock(>lock);
> >>> @@ -1828,6 +1837,10 @@ static int vduse_create_dev(struct 
> >>> vduse_dev_config *config,
> >>>   int ret;
> >>>   struct vduse_dev *dev;
> >>>  
> >>> + ret = -EPERM;
> >>> + if (security_vduse_perm_check(VDUSE_PERM_CREATE, config->device_id))
> >>> + goto err;
> >>> +
> >>>   ret = -EEXIST;
> >>>   if (vduse_find_dev(config->name))
> >>>   goto err;
> >>> diff --git 

[PATCH v8 3/3] remoteproc: zynqmp: parse TCM from device tree

2023-12-15 Thread Tanmay Shah
ZynqMP TCM information is fixed in driver. Now ZynqMP TCM information
is available in device-tree. Parse TCM information in driver
as per new bindings.

Signed-off-by: Tanmay Shah 
---

Changes in v8:
  - parse power-domains property from device-tree and use EEMI calls
to power on/off TCM instead of using pm domains framework
  - Remove checking of pm_domain_id validation to power on/off tcm
  - Remove spurious change

Changes in v7:
  - move checking of pm_domain_id from previous patch
  - fix mem_bank_data memory allocation

 drivers/remoteproc/xlnx_r5_remoteproc.c | 154 +++-
 1 file changed, 148 insertions(+), 6 deletions(-)

diff --git a/drivers/remoteproc/xlnx_r5_remoteproc.c 
b/drivers/remoteproc/xlnx_r5_remoteproc.c
index 4395edea9a64..36d73dcd93f0 100644
--- a/drivers/remoteproc/xlnx_r5_remoteproc.c
+++ b/drivers/remoteproc/xlnx_r5_remoteproc.c
@@ -74,8 +74,8 @@ struct mbox_info {
 };
 
 /*
- * Hardcoded TCM bank values. This will be removed once TCM bindings are
- * accepted for system-dt specifications and upstreamed in linux kernel
+ * Hardcoded TCM bank values. This will stay in driver to maintain backward
+ * compatibility with device-tree that does not have TCM information.
  */
 static const struct mem_bank_data zynqmp_tcm_banks_split[] = {
{0xffe0UL, 0x0, 0x1UL, PD_R5_0_ATCM, "atcm0"}, /* TCM 64KB each 
*/
@@ -878,6 +878,139 @@ static struct zynqmp_r5_core 
*zynqmp_r5_add_rproc_core(struct device *cdev)
return ERR_PTR(ret);
 }
 
+static int zynqmp_r5_get_tcm_node_from_dt(struct zynqmp_r5_cluster *cluster)
+{
+   struct of_phandle_args out_args;
+   int tcm_reg_per_r5, tcm_pd_idx;
+   struct zynqmp_r5_core *r5_core;
+   int i, j, tcm_bank_count, ret;
+   struct platform_device *cpdev;
+   struct mem_bank_data *tcm;
+   struct device_node *np;
+   struct resource *res;
+   u64 abs_addr, size;
+   struct device *dev;
+
+   for (i = 0; i < cluster->core_count; i++) {
+   r5_core = cluster->r5_cores[i];
+   dev = r5_core->dev;
+   np = of_node_get(dev_of_node(dev));
+   tcm_pd_idx = 1;
+
+   /* we have address cell 2 and size cell as 2 */
+   tcm_reg_per_r5 = of_property_count_elems_of_size(np, "reg",
+4 * 
sizeof(u32));
+   if (tcm_reg_per_r5 <= 0) {
+   dev_err(dev, "can't get reg property err %d\n", 
tcm_reg_per_r5);
+   return -EINVAL;
+   }
+
+   /*
+* In lockstep mode, r5 core 0 will use r5 core 1 TCM
+* power domains as well. so allocate twice of per core TCM
+*/
+   if (cluster->mode == LOCKSTEP_MODE)
+   tcm_bank_count = tcm_reg_per_r5 * 2;
+   else
+   tcm_bank_count = tcm_reg_per_r5;
+
+   r5_core->tcm_banks = devm_kcalloc(dev, tcm_bank_count,
+ sizeof(struct mem_bank_data 
*),
+ GFP_KERNEL);
+   if (!r5_core->tcm_banks)
+   ret = -ENOMEM;
+
+   r5_core->tcm_bank_count = tcm_bank_count;
+   for (j = 0; j < tcm_bank_count; j++) {
+   tcm = devm_kzalloc(dev, sizeof(struct mem_bank_data),
+  GFP_KERNEL);
+   if (!tcm)
+   return -ENOMEM;
+
+   r5_core->tcm_banks[j] = tcm;
+
+   /*
+* In lockstep mode, get second core's TCM power 
domains id
+* after first core TCM parsing is done as
+*/
+   if (j == tcm_reg_per_r5) {
+   /* dec first core node */
+   of_node_put(np);
+
+   /* get second core node */
+   np = of_get_next_child(cluster->dev->of_node, 
np);
+
+   /*
+* reset index of power-domains property list
+* for second core
+*/
+   tcm_pd_idx = 1;
+   }
+
+   /* get power-domains id of tcm */
+   ret = of_parse_phandle_with_args(np, "power-domains",
+"#power-domain-cells",
+tcm_pd_idx,
+_args);
+   if (ret) {
+   dev_err(r5_core->dev,
+   "failed to get tcm %d pm domain, ret 
%d\n",
+

[PATCH v8 1/3] dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

2023-12-15 Thread Tanmay Shah
From: Radhey Shyam Pandey 

Introduce bindings for TCM memory address space on AMD-xilinx Zynq
UltraScale+ platform. It will help in defining TCM in device-tree
and make it's access platform agnostic and data-driven.

Tightly-coupled memories(TCMs) are low-latency memory that provides
predictable instruction execution and predictable data load/store
timing. Each Cortex-R5F processor contains two 64-bit wide 64 KB memory
banks on the ATCM and BTCM ports, for a total of 128 KB of memory.

The TCM resources(reg, reg-names and power-domain) are documented for
each TCM in the R5 node. The reg and reg-names are made as required
properties as we don't want to hardcode TCM addresses for future
platforms and for zu+ legacy implementation will ensure that the
old dts w/o reg/reg-names works and stable ABI is maintained.

It also extends the examples for TCM split and lockstep modes.

Signed-off-by: Radhey Shyam Pandey 
Signed-off-by: Tanmay Shah 
Acked-by: Rob Herring 
---
 .../remoteproc/xlnx,zynqmp-r5fss.yaml | 131 +++---
 1 file changed, 113 insertions(+), 18 deletions(-)

diff --git 
a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml 
b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
index 78aac69f1060..9ecd63ea1b38 100644
--- a/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
+++ b/Documentation/devicetree/bindings/remoteproc/xlnx,zynqmp-r5fss.yaml
@@ -20,6 +20,17 @@ properties:
   compatible:
 const: xlnx,zynqmp-r5fss
 
+  "#address-cells":
+const: 2
+
+  "#size-cells":
+const: 2
+
+  ranges:
+description: |
+  Standard ranges definition providing address translations for
+  local R5F TCM address spaces to bus addresses.
+
   xlnx,cluster-mode:
 $ref: /schemas/types.yaml#/definitions/uint32
 enum: [0, 1, 2]
@@ -37,7 +48,7 @@ properties:
   2: single cpu mode
 
 patternProperties:
-  "^r5f-[a-f0-9]+$":
+  "^r5f@[0-9a-f]+$":
 type: object
 description: |
   The RPU is located in the Low Power Domain of the Processor Subsystem.
@@ -54,8 +65,19 @@ patternProperties:
   compatible:
 const: xlnx,zynqmp-r5f
 
+  reg:
+items:
+  - description: ATCM internal memory region
+  - description: BTCM internal memory region
+
+  reg-names:
+items:
+  - const: atcm
+  - const: btcm
+
   power-domains:
-maxItems: 1
+minItems: 1
+maxItems: 3
 
   mboxes:
 minItems: 1
@@ -102,34 +124,107 @@ patternProperties:
 required:
   - compatible
   - power-domains
+  - reg
+  - reg-names
 
 unevaluatedProperties: false
 
 required:
   - compatible
+  - "#address-cells"
+  - "#size-cells"
+  - ranges
 
 additionalProperties: false
 
 examples:
   - |
-remoteproc {
-compatible = "xlnx,zynqmp-r5fss";
-xlnx,cluster-mode = <1>;
-
-r5f-0 {
-compatible = "xlnx,zynqmp-r5f";
-power-domains = <_firmware 0x7>;
-memory-region = <_0_fw_image>, <>, 
<>, <>;
-mboxes = <_mailbox_rpu0 0>, <_mailbox_rpu0 1>;
-mbox-names = "tx", "rx";
+#include 
+
+//Split mode configuration
+soc {
+#address-cells = <2>;
+#size-cells = <2>;
+
+remoteproc@ffe0 {
+compatible = "xlnx,zynqmp-r5fss";
+xlnx,cluster-mode = <0>;
+
+#address-cells = <2>;
+#size-cells = <2>;
+ranges = <0x0 0x0 0x0 0xffe0 0x0 0x1>,
+ <0x0 0x2 0x0 0xffe2 0x0 0x1>,
+ <0x1 0x0 0x0 0xffe9 0x0 0x1>,
+ <0x1 0x2 0x0 0xffeb 0x0 0x1>;
+
+r5f@0 {
+compatible = "xlnx,zynqmp-r5f";
+reg = <0x0 0x0 0x0 0x1>, <0x0 0x2 0x0 0x1>;
+reg-names = "atcm", "btcm";
+power-domains = <_firmware PD_RPU_0>,
+<_firmware PD_R5_0_ATCM>,
+<_firmware PD_R5_0_BTCM>;
+memory-region = <_0_fw_image>, <>,
+<>, <>;
+mboxes = <_mailbox_rpu0 0>, <_mailbox_rpu0 1>;
+mbox-names = "tx", "rx";
+};
+
+r5f@1 {
+compatible = "xlnx,zynqmp-r5f";
+reg = <0x1 0x0 0x0 0x1>, <0x1 0x2 0x0 0x1>;
+reg-names = "atcm", "btcm";
+power-domains = <_firmware PD_RPU_1>,
+<_firmware PD_R5_1_ATCM>,
+<_firmware PD_R5_1_BTCM>;
+memory-region = <_1_fw_image>, <>,
+<>, <>;
+mboxes = <_mailbox_rpu1 0>, <_mailbox_rpu1 1>;
+mbox-names = "tx", "rx";
+};
 };
+};
 
-r5f-1 {
-compatible = "xlnx,zynqmp-r5f";
-

[PATCH v8 0/3] add zynqmp TCM bindings

2023-12-15 Thread Tanmay Shah
Tightly-Coupled Memories(TCMs) are low-latency memory that provides
predictable instruction execution and predictable data load/store
timing. Each Cortex-R5F processor contains exclusive two 64 KB memory
banks on the ATCM and BTCM ports, for a total of 128 KB of memory.
In lockstep mode, both 128KB memory is accessible to the cluster.

As per ZynqMP Ultrascale+ Technical Reference Manual UG1085, following
is address space of TCM memory. The bindings in this patch series
introduces properties to accommodate following address space with
address translation between Linux and Cortex-R5 views.

| | | |
| --- | --- | --- |
|  *Mode*|   *R5 View* | *Linux view* |  Notes   |
| *Split Mode*   | *start addr*| *start addr* |  |
| R5_0 ATCM (64 KB)  | 0x_ | 0xFFE0_  |  |
| R5_0 BTCM (64 KB)  | 0x0002_ | 0xFFE2_  |  |
| R5_1 ATCM (64 KB)  | 0x_ | 0xFFE9_  | alias of 0xFFE1_ |
| R5_1 BTCM (64 KB)  | 0x0002_ | 0xFFEB_  | alias of 0xFFE3_ |
|  ___   | ___ |___   |  |
| *Lockstep Mode*| |  |  |
| R5_0 ATCM (128 KB) | 0x_ | 0xFFE0_  |  |
| R5_0 BTCM (128 KB) | 0x0002_ | 0xFFE2_  |  |

References:
UG1085 TCM address space:
https://docs.xilinx.com/r/en-US/ug1085-zynq-ultrascale-trm/Tightly-Coupled-Memory-Address-Map

Changes in v8:
  - Remove use of pm_domains framework
  - Remove checking of pm_domain_id validation to power on/off tcm
  - Remove spurious change
  - parse power-domains property from device-tree and use EEMI calls
to power on/off TCM instead of using pm domains framework

Changes in v7:
  - %s/pm_dev1/pm_dev_core0/r
  - %s/pm_dev_link1/pm_dev_core0_link/r
  - %s/pm_dev2/pm_dev_core1/r
  - %s/pm_dev_link2/pm_dev_core1_link/r
  - remove pm_domain_id check to move next patch
  - add comment about how 1st entry in pm domain list is used
  - fix loop when jump to fail_add_pm_domains loop
  - move checking of pm_domain_id from previous patch
  - fix mem_bank_data memory allocation

Changes in v6:
  - Introduce new node entry for r5f cluster split mode dts and
keep it disabled by default.
  - Keep remoteproc lockstep mode enabled by default to maintian
back compatibility.
  - Enable split mode only for zcu102 board to demo split mode use
  - Remove spurious change
  - Handle errors in add_pm_domains function
  - Remove redundant code to handle errors from remove_pm_domains
  - Missing . at the end of the commit message
  - remove redundant initialization of variables
  - remove fail_tcm label and relevant code to free memory
acquired using devm_* API. As this will be freed when device free it
  - add extra check to see if "reg" property is supported or not

Changes in v5:
  - maintain Rob's Ack on bindings patch as no changes in bindings
  - split previous patch into multiple patches
  - Use pm domain framework to turn on/off TCM
  - Add support of parsing TCM information from device-tree
  - maintain backward compatibility with previous bindings without
TCM information available in device-tree

This patch series continues previous effort to upstream ZynqMP
TCM bindings:
Previous v4 version link:
https://lore.kernel.org/all/20230829181900.2561194-1-tanmay.s...@amd.com/

Previous v3 version link:
https://lore.kernel.org/all/1689964908-22371-1-git-send-email-radhey.shyam.pan...@amd.com/
Radhey Shyam Pandey (1):
  dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings


Radhey Shyam Pandey (1):
  dt-bindings: remoteproc: add Tightly Coupled Memory (TCM) bindings

Tanmay Shah (2):
  dts: zynqmp: add properties for TCM in remoteproc
  remoteproc: zynqmp: parse TCM from device tree

 .../remoteproc/xlnx,zynqmp-r5fss.yaml | 131 +--
 .../boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts  |   8 +
 arch/arm64/boot/dts/xilinx/zynqmp.dtsi|  60 ++-
 drivers/remoteproc/xlnx_r5_remoteproc.c   | 154 +-
 4 files changed, 324 insertions(+), 29 deletions(-)


base-commit: 7641890179f913ce73d9dae490b8ab74970fc552
-- 
2.25.1




[PATCH v8 2/3] dts: zynqmp: add properties for TCM in remoteproc

2023-12-15 Thread Tanmay Shah
Add properties as per new bindings in zynqmp remoteproc node
to represent TCM address and size.

This patch also adds alternative remoteproc node to represent
remoteproc cluster in split mode. By default lockstep mode is
enabled and users should disable it before using split mode
dts. Both device-tree nodes can't be used simultaneously one
of them must be disabled. For zcu102-1.0 and zcu102-1.1 board
remoteproc split mode dts node is enabled and lockstep mode
dts is disabled.

Signed-off-by: Tanmay Shah 
---
 .../boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts  |  8 +++
 arch/arm64/boot/dts/xilinx/zynqmp.dtsi| 60 +--
 2 files changed, 63 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts 
b/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts
index c8f71a1aec89..495ca94b45db 100644
--- a/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts
+++ b/arch/arm64/boot/dts/xilinx/zynqmp-zcu102-rev1.0.dts
@@ -14,6 +14,14 @@ / {
compatible = "xlnx,zynqmp-zcu102-rev1.0", "xlnx,zynqmp-zcu102", 
"xlnx,zynqmp";
 };
 
+_split {
+   status = "okay";
+};
+
+_lockstep {
+   status = "disabled";
+};
+
  {
#address-cells = <1>;
#size-cells = <1>;
diff --git a/arch/arm64/boot/dts/xilinx/zynqmp.dtsi 
b/arch/arm64/boot/dts/xilinx/zynqmp.dtsi
index b61fc99cd911..602e6aba7ac5 100644
--- a/arch/arm64/boot/dts/xilinx/zynqmp.dtsi
+++ b/arch/arm64/boot/dts/xilinx/zynqmp.dtsi
@@ -247,19 +247,69 @@ fpga_full: fpga-full {
ranges;
};
 
-   remoteproc {
+   rproc_lockstep: remoteproc@ffe0 {
compatible = "xlnx,zynqmp-r5fss";
xlnx,cluster-mode = <1>;
 
-   r5f-0 {
+   #address-cells = <2>;
+   #size-cells = <2>;
+
+   ranges = <0x0 0x0 0x0 0xffe0 0x0 0x2>,
+<0x0 0x2 0x0 0xffe2 0x0 0x2>,
+<0x1 0x0 0x0 0xffe9 0x0 0x1>,
+<0x1 0x2 0x0 0xffeb 0x0 0x1>;
+
+   r5f@0 {
+   compatible = "xlnx,zynqmp-r5f";
+   reg = <0x0 0x0 0x0 0x2>, <0x0 0x2 0x0 0x2>;
+   reg-names = "atcm", "btcm";
+   power-domains = <_firmware PD_RPU_0>,
+   <_firmware PD_R5_0_ATCM>,
+   <_firmware PD_R5_0_BTCM>;
+   memory-region = <_0_fw_image>;
+   };
+
+   r5f@1 {
+   compatible = "xlnx,zynqmp-r5f";
+   reg = <0x1 0x0 0x0 0x1>, <0x1 0x2 0x0 0x1>;
+   reg-names = "atcm", "btcm";
+   power-domains = <_firmware PD_RPU_1>,
+   <_firmware PD_R5_1_ATCM>,
+   <_firmware PD_R5_1_BTCM>;
+   memory-region = <_1_fw_image>;
+   };
+   };
+
+   rproc_split: remoteproc-split@ffe0 {
+   status = "disabled";
+   compatible = "xlnx,zynqmp-r5fss";
+   xlnx,cluster-mode = <0>;
+
+   #address-cells = <2>;
+   #size-cells = <2>;
+
+   ranges = <0x0 0x0 0x0 0xffe0 0x0 0x1>,
+<0x0 0x2 0x0 0xffe2 0x0 0x1>,
+<0x1 0x0 0x0 0xffe9 0x0 0x1>,
+<0x1 0x2 0x0 0xffeb 0x0 0x1>;
+
+   r5f@0 {
compatible = "xlnx,zynqmp-r5f";
-   power-domains = <_firmware PD_RPU_0>;
+   reg = <0x0 0x0 0x0 0x1>, <0x0 0x2 0x0 0x1>;
+   reg-names = "atcm", "btcm";
+   power-domains = <_firmware PD_RPU_0>,
+   <_firmware PD_R5_0_ATCM>,
+   <_firmware PD_R5_0_BTCM>;
memory-region = <_0_fw_image>;
};
 
-   r5f-1 {
+   r5f@1 {
compatible = "xlnx,zynqmp-r5f";
-   power-domains = <_firmware PD_RPU_1>;
+   reg = <0x1 0x0 0x0 0x1>, <0x1 0x2 0x0 0x1>;
+   reg-names = "atcm", "btcm";
+   power-domains = <_firmware PD_RPU_1>,
+   <_firmware PD_R5_1_ATCM>,
+   <_firmware PD_R5_1_BTCM>;
memory-region = <_1_fw_image>;
};
};
-- 
2.25.1




Re: [PATCH v2 2/2] ARM: dts: qcom: msm8926-motorola-peregrine: Add initial device tree

2023-12-15 Thread André Apitzsch
Am Freitag, dem 15.12.2023 um 00:37 +0100 schrieb Konrad Dybcio:
> 
> 
> On 12/14/23 21:59, André Apitzsch wrote:
> > This dts adds support for Motorola Moto G 4G released in 2013.
> > 
> > Add a device tree with initial support for:
> > 
> > - GPIO keys
> > - Hall sensor
> > - SDHCI
> > - Vibrator
> > 
> > Signed-off-by: André Apitzsch 
> > ---
> Excellent, thanks!
> 
> Reviewed-by: Konrad Dybcio 
> 
> [...]
> 
> > +   pm8226_lvs1: lvs1 {
> > +   /* Pull-up for I2C lines */
> > +   regulator-always-on;
> > +   };
> just one q: is this intended to stay a-on, or will it be bound
> to some i2c host (presumably camera)?
> 
> Konrad

It's intended to stay always on. At least that's what downstream [1]
did.

André

[1] 
https://github.com/LineageOS/android_kernel_motorola_msm8226/commit/bac97acacb0a868fa2cf96e53e18d6653c409a1b



Re: [PATCH v6 09/12] x86/sgx: Restructure top-level EPC reclaim function

2023-12-15 Thread Haitao Huang

On Wed, 13 Dec 2023 05:17:11 -0600, Huang, Kai  wrote:


On Mon, 2023-12-11 at 22:04 -0600, Haitao Huang wrote:

Hi Kai

On Mon, 27 Nov 2023 03:57:03 -0600, Huang, Kai   
wrote:


> On Mon, 2023-11-27 at 00:27 +0800, Haitao Huang wrote:
> > On Mon, 20 Nov 2023 11:45:46 +0800, Huang, Kai 
> > wrote:
> >
> > > On Mon, 2023-10-30 at 11:20 -0700, Haitao Huang wrote:
> > > > From: Sean Christopherson 
> > > >
> > > > To prepare for per-cgroup reclamation, separate the top-level
> > reclaim
> > > > function, sgx_reclaim_epc_pages(), into two separate functions:
> > > >
> > > > - sgx_isolate_epc_pages() scans and isolates reclaimable pages  
from

> > a
> > > > given LRU list.
> > > > - sgx_do_epc_reclamation() performs the real reclamation for the
> > > > already isolated pages.
> > > >
> > > > Create a new function, sgx_reclaim_epc_pages_global(), calling
> > those two
> > > > in succession, to replace the original sgx_reclaim_epc_pages().  
The

> > > > above two functions will serve as building blocks for the
> > reclamation
> > > > flows in later EPC cgroup implementation.
> > > >
> > > > sgx_do_epc_reclamation() returns the number of reclaimed pages.  
The

> > EPC
> > > > cgroup will use the result to track reclaiming progress.
> > > >
> > > > sgx_isolate_epc_pages() returns the additional number of pages  
to

> > scan
> > > > for current epoch of reclamation. The EPC cgroup will use the
> > result to
> > > > determine if more scanning to be done in LRUs in its children
> > groups.
> > >
> > > This changelog says nothing about "why", but only mentions the
> > > "implementation".
> > >
> > > For instance, assuming we need to reclaim @npages_to_reclaim from  
the

> > > @epc_cgrp_to_reclaim and its descendants, why cannot we do:
> > >
> > > 	for_each_cgroup_and_descendants(_cgrp_to_reclaim, _cgrp)  
{

> > >  if (npages_to_reclaim <= 0)
> > >  return;
> > >
> > >  npages_to_reclaim -= sgx_reclaim_pages_lru(_cgrp->lru,
> > >  npages_to_reclaim);
> > >  }
> > >
> > > Is there any difference to have "isolate" + "reclaim"?
> > >
> >
> > This is to optimize "reclaim". See how etrack was done in  
sgx_encl_ewb.
> > Here we just follow the same design as ksgxd for each reclamation  
cycle.

>
> I don't see how did you "follow" ksgxd.  If I am guessing correctly,  
you

> are
> afraid of there might be less than 16 pages in a given EPC cgroup,  
thus

> w/o
> splitting into "isolate" + "reclaim" you might feed the "reclaim" less
> than 16
> pages, which might cause some performance degrade?
>
> But is this a common case?  Should we even worry about this?
>
> I suppose for such new feature we should bring functionality first and
> then
> optimization if you have real performance data to show.
>
The concern is not about "reclaim less than 16".
I mean this is just refactoring with exactly the same design of ksgxd
preserved,


I literally have no idea what you are talking about here.  ksgxd() just  
calls

sgx_reclaim_pages(), which tries to reclaim 16 pages at once.


in that we first isolate as many candidate EPC pages (up  to
16, ignore the unneeded SGX_NR_TO_SCAN_MAX for now), then does the ewb  
in

one shot without anything else done in between.


Assuming you are referring the implementation of sgx_reclaim_pages(), and
assuming the "isolate" you mean removing EPC pages from the list (which  
is
exactly what the sgx_isolate_epc_pages() in this patch does), what  
happens to
the loops of "backing store allocation" and "EBLOCK", before the loop of  
EWB?Eaten by you?




I skipped those as what really matters is to keep ewb loop separate and  
run in one shot for each reclaiming cycle, not dependent on number of  
LRUs.  All those loops in original sgx_reclaim_pages() except the  
"isolate" loop are not dealing with multiple LRUs of cgroups later. That's  
the reason to refactor out only the "isolate" part and loop it through  
cgroup LRUs in later patches.





As described in original
comments for the function sgx_reclaim_pages and sgx_encl_ewb, this is to
finish all ewb quickly while minimizing impact of IPI.

The way you proposed will work but alters the current design and  
behavior

if cgroups is enabled and EPCs of an enclave are tracked across multiple
LRUs within the descendant cgroups, in that you will have isolation  
loop,
backing store allocation loop, eblock loop interleaved with the ewb  
loop.




I have no idea what you are talking about.

The point is, with or w/o this patch, you can only reclaim 16 EPC pages  
in one
function call (as you have said you are going to remove  
SGX_NR_TO_SCAN_MAX,
which is a cipher to both of us).  The only difference I can see is,  
with this
patch, you can have multiple calls of "isolate" and then call the  
"do_reclaim"

once.

But what's the good of having the "isolate" if the "do_reclaim" can only  
reclaim

16 pages anyway?

Back to my last reply, are you afraid of any LRU has less than 16 pages  

Re: [PATCH] tracing: Add disable-filter-buf option

2023-12-15 Thread Mathieu Desnoyers

On 2023-12-15 13:43, Steven Rostedt wrote:

On Fri, 15 Dec 2023 13:25:07 -0500
Mathieu Desnoyers  wrote:


I am not against exposing an ABI that allows userspace to alter the
filter behavior. I disagree on the way you plan to expose the ABI.


These are no different than the knobs for sched_debug


These are very much different. The sched features are enabled at
build-time by modifying kernel/sched/features.h. There is no userspace
ABI involved.





Exposing this option as an ABI in this way exposes too much internal
ring buffer implementation details to userspace.


There's already lots of exposure using options. The options are just
knobs, nothing more.



It exposes the following details which IMHO should be hidden or
configurable in a way that allows moving to a whole new mechanism
which will have significantly different characteristics in the
future:

It exposes that:

- filtering uses a copy to a temporary buffer, and
- that this copy is enabled by default.

Once exposed, those design constraints become immutable due to ABI.


No it is not. There is no such thing as "immutable ABI". The rule is
"don't break user space" If this functionality in the kernel goes away,
the knob could become a nop, and I doubt any user space will break
because of it.

That is, the only burden is keeping this option exposed. But it could
be just like that light switch that has nothing connected to it. It's
still there, but does nothing if you switch it. This knob can act the
same way. This does not in anyway prevent future innovation.


I am not comfortable with exposing internal ring buffer implementation
details to userspace which may or may not be deprecated as no-ops
in the future. This will lead to useless clutter.

One approach that might be somewhat better would be to only expose
those files when a new CONFIG_TRACING_DEBUG option is enabled. This
way userspace cannot rely on having those files present as part
of the ABI, but you would still be free to use them for selftests
by skipping the specific selftests if the config option is not
enabled.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH] tracing: Add disable-filter-buf option

2023-12-15 Thread Steven Rostedt
On Fri, 15 Dec 2023 13:25:07 -0500
Mathieu Desnoyers  wrote:
> 
> I am not against exposing an ABI that allows userspace to alter the
> filter behavior. I disagree on the way you plan to expose the ABI.

These are no different than the knobs for sched_debug

> 
> Exposing this option as an ABI in this way exposes too much internal
> ring buffer implementation details to userspace.

There's already lots of exposure using options. The options are just
knobs, nothing more.

> 
> It exposes the following details which IMHO should be hidden or
> configurable in a way that allows moving to a whole new mechanism
> which will have significantly different characteristics in the
> future:
> 
> It exposes that:
> 
> - filtering uses a copy to a temporary buffer, and
> - that this copy is enabled by default.
> 
> Once exposed, those design constraints become immutable due to ABI.

No it is not. There is no such thing as "immutable ABI". The rule is
"don't break user space" If this functionality in the kernel goes away,
the knob could become a nop, and I doubt any user space will break
because of it.

That is, the only burden is keeping this option exposed. But it could
be just like that light switch that has nothing connected to it. It's
still there, but does nothing if you switch it. This knob can act the
same way. This does not in anyway prevent future innovation.

> 
> > 
> > The option is just a way to say "I don't want to do the copy into the
> > buffer, I want to go directly into it"  
> 
> My main concern is how this concept, once exposed to userspace,
> becomes not only an internal implementation detail, but a fundamental
> part of the design which cannot ever go away without breaking the ABI
> or making parts of the ABI irrelevant.

Again, it's not about breaking ABI. Linus himself stated that Linux
constantly breaks ABI. As long as breaking ABI doesn't break user
space, it's OK. I've broken tracing ABI several times over the last
decade, and nobody complained. It's *how" you break it. It's similar to
the tree falling in the forest. If you break ABI and no user space
application notices, did you really break ABI?

> 
> I can make a parallel with the scheduler: this is as if the sched
> feature knobs (which are there only for development/debugging purposes)
> would all be exposed as userspace ABI. This would seriously
> limit the evolution of the scheduler design in the future. I see this
> as the same problem at the ring buffer level.

Heh, I mentioned sched before reading this. Again, it's whether or not
userspace can notice if behavior changes or not. If it can't notice, it
didn't break.

> 
> >   
> >>
> >> I don't care about the implementation, I care about the ABI, and
> >> I feel that your reply is not addressing my concern at all.  
> > 
> > Maybe I don't understand your concern.
> > 
> > It's an on/off switch (like all options are). This is just a way to say
> > "I want to indirect copying of the event before filtering or not".  
> 
> Not all tracefs options are booleans. The "current_tracer" file ABI

"current_trace" is not an option. Look in /sys/kernel/tracing/options.
They are all boolean.

> exposes a string input/output parameter. I would recommend the same
> for the equivalent of a "current_filter" file.

I still do not see the benefit. The "current_filter" would expose just
as much implementation that I can see. This option is just an knob to
turn it on or off. Most options, nobody knows about, and are seldom
used by anyone but me ;-)

Once you put a file in the main directory, it become much more likely
to be used.

> 
> > 
> > The "input-argument" part above may never happen. What's different
> > between tracefs and LTTng, is that all events are created by the
> > subsystem not by me. You don't use the TRACE_EVENT() macro, but you
> > need to manually create each event you care about yourself. It's more
> > of a burden but you also then have the luxury of hooking to the input
> > portion. That is not exposed, and that is by design. As that could
> > possibly make all tracepoints an ABI, and you'll have people like
> > peterz nacking any new tracepoint in the scheduler. He doesn't allow
> > trace events anymore because of that exposure.  
> 
> I'm not arguing for moving to the input-argument scheme, I just used
> this as an hypothetical example to show why we should not expose
> internal implementation details to userspace which will prevent future
> evolution only for the sake of having debugging knobs.

Again, this is just a knob in the options directory, and not a full
fledge feature. Options are not always guaranteed to be there,
depending on the config. There's some options that are even created by
what tracer is in "current_tracer".

I added this mainly to be able to add a kselftest to stress test the
discard code. If it would make you feel better, I could even add a
config around it to have it only exposed if the config is enabled.

-- Steve



Re: [PATCH] tracing: Add disable-filter-buf option

2023-12-15 Thread Mathieu Desnoyers

On 2023-12-15 12:34, Steven Rostedt wrote:

On Fri, 15 Dec 2023 12:24:14 -0500
Mathieu Desnoyers  wrote:


On 2023-12-15 12:04, Steven Rostedt wrote:

On Fri, 15 Dec 2023 10:53:39 -0500
Mathieu Desnoyers  wrote:

[...]


So rather than stacking tons of "on/off" switches for filter
features, how about you let users express the mechanism they
want to use for filtering with a string instead ? e.g.

filter-method="temp-buffer"
filter-method="ring-buffer"
filter-method="input-arguments"


If I add other ways to filter, it will be a separate file to control
that, but all options are on/off switches. Even if I add other
functionality to the way buffers are created, this will still have the
same functionality to turn the entire thing on or off.


I'll be clearer then: I think this is a bad ABI. In your reply, you justify
this bad ABI by implementation motivations.


What's wrong with a way to stop the copying ?


I am not against exposing an ABI that allows userspace to alter the
filter behavior. I disagree on the way you plan to expose the ABI.

Exposing this option as an ABI in this way exposes too much internal
ring buffer implementation details to userspace.

It exposes the following details which IMHO should be hidden or
configurable in a way that allows moving to a whole new mechanism
which will have significantly different characteristics in the
future:

It exposes that:

- filtering uses a copy to a temporary buffer, and
- that this copy is enabled by default.

Once exposed, those design constraints become immutable due to ABI.



The option is just a way to say "I don't want to do the copy into the
buffer, I want to go directly into it"


My main concern is how this concept, once exposed to userspace,
becomes not only an internal implementation detail, but a fundamental
part of the design which cannot ever go away without breaking the ABI
or making parts of the ABI irrelevant.

I can make a parallel with the scheduler: this is as if the sched
feature knobs (which are there only for development/debugging purposes)
would all be exposed as userspace ABI. This would seriously
limit the evolution of the scheduler design in the future. I see this
as the same problem at the ring buffer level.





I don't care about the implementation, I care about the ABI, and
I feel that your reply is not addressing my concern at all.


Maybe I don't understand your concern.

It's an on/off switch (like all options are). This is just a way to say
"I want to indirect copying of the event before filtering or not".


Not all tracefs options are booleans. The "current_tracer" file ABI
exposes a string input/output parameter. I would recommend the same
for the equivalent of a "current_filter" file.



The "input-argument" part above may never happen. What's different
between tracefs and LTTng, is that all events are created by the
subsystem not by me. You don't use the TRACE_EVENT() macro, but you
need to manually create each event you care about yourself. It's more
of a burden but you also then have the luxury of hooking to the input
portion. That is not exposed, and that is by design. As that could
possibly make all tracepoints an ABI, and you'll have people like
peterz nacking any new tracepoint in the scheduler. He doesn't allow
trace events anymore because of that exposure.


I'm not arguing for moving to the input-argument scheme, I just used
this as an hypothetical example to show why we should not expose
internal implementation details to userspace which will prevent future
evolution only for the sake of having debugging knobs.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




[PATCH v2] tracing: Add filter-buffer option

2023-12-15 Thread Steven Rostedt
From 62a1de0f0f9d942934565e625a7880fd85ae216a Mon Sep 17 00:00:00 2001
From: "Steven Rostedt (Google)" 
Date: Fri, 15 Dec 2023 10:26:33 -0500
Subject: [PATCH] tracing: Add filter-buffer option

Normally, when the filter is enabled, a temporary buffer is created to
copy the event data into it to perform the filtering logic. If the filter
passes and the event should be recorded, then the event is copied from the
temporary buffer into the ring buffer. If the event is to be discarded
then it is simply dropped. If another event comes in via an interrupt, it
will not use the temporary buffer as it is busy and will write directly
into the ring buffer.

The filter-buffer option will allow the user to disable this feature. By
default, it is enabled. When disabled, it disables the temporary buffer
and always writes into the ring buffer. This will avoid the copy when the
event is to be recorded, but also adds a bit more overhead on the discard,
and if another event were to interrupt the event that is to be discarded,
then the event will not be removed from the ring buffer but instead
converted to padding that will not be read by the reader. Padding will
still take up space on the ring buffer.

This option can be beneficial if most events are recorded and not
discarded, or simply for debugging the discard functionality of the ring
buffer.

Also fix some whitespace (that was fixed by editing this in vscode).

Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20231215102633.7a24c...@rorschach.local.home

- Renamed "disable-filter-buffer" to "filter-buffer" and made it
  default enabled, where the user needs to disable it. (Mathieu Desnoyers)


 Documentation/trace/ftrace.rst | 23 
 kernel/trace/trace.c   | 39 --
 kernel/trace/trace.h   |  1 +
 3 files changed, 47 insertions(+), 16 deletions(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 23572f6697c0..7ec26eb814e9 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1239,6 +1239,29 @@ Here are the available options:
When the free_buffer is closed, tracing will
stop (tracing_on set to 0).
 
+  filter-buffer
+   Normally, when the filter is enabled, a temporary buffer is
+   created to copy the event data into it to perform the
+   filtering logic. If the filter passes and the event should
+   be recorded, then the event is copied from the temporary
+   buffer into the ring buffer. If the event is to be discarded
+   then it is simply dropped. If another event comes in via
+   an interrupt, it will not use the temporary buffer as it is
+   busy and will write directly into the ring buffer.
+
+   This option, when cleared, will disable the temporary buffer and always
+   write into the ring buffer. This will avoid the copy when
+   the event is to be recorded, but also adds a bit more
+   overhead on the discard, and if another event were to interrupt
+   the event that is to be discarded, then the event will not
+   be removed from the ring buffer but instead converted to
+   padding that will not be read by the reader. Padding will
+   still take up space on the ring buffer.
+
+   This option can be beneficial if most events are recorded and
+   not discarded, or simply for debugging the discard functionality
+   of the ring buffer.
+
   irq-info
Shows the interrupt, preempt count, need resched data.
When disabled, the trace looks like::
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 55dabee4c78b..e18c83104e24 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -466,7 +466,7 @@ EXPORT_SYMBOL_GPL(unregister_ftrace_export);
 TRACE_ITER_ANNOTATE | TRACE_ITER_CONTEXT_INFO |\
 TRACE_ITER_RECORD_CMD | TRACE_ITER_OVERWRITE | \
 TRACE_ITER_IRQ_INFO | TRACE_ITER_MARKERS | \
-TRACE_ITER_HASH_PTR)
+TRACE_ITER_HASH_PTR | TRACE_ITER_FILTER_BUF)
 
 /* trace_options that are only supported by global_trace */
 #define TOP_LEVEL_TRACE_FLAGS (TRACE_ITER_PRINTK | \
@@ -5398,6 +5398,8 @@ int trace_keep_overwrite(struct tracer *tracer, u32 mask, 
int set)
return 0;
 }
 
+static int __tracing_set_filter_buffering(struct trace_array *tr, bool set);
+
 int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
 {
int *map;
@@ -5451,6 +5453,9 @@ int set_tracer_flag(struct trace_array *tr, unsigned int 
mask, int enabled)
if (mask == TRACE_ITER_FUNC_FORK)
ftrace_pid_follow_fork(tr, enabled);
 
+   if (mask == TRACE_ITER_FILTER_BUF)
+   __tracing_set_filter_buffering(tr, !enabled);
+
if (mask == TRACE_ITER_OVERWRITE) {

Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq

2023-12-15 Thread Eugenio Perez Martin
On Fri, Dec 15, 2023 at 3:13 PM Dragos Tatulea  wrote:
>
> On Fri, 2023-12-15 at 12:35 +, Dragos Tatulea wrote:
> > On Thu, 2023-12-14 at 19:30 +0100, Eugenio Perez Martin wrote:
> > > On Thu, Dec 14, 2023 at 4:51 PM Dragos Tatulea  
> > > wrote:
> > > >
> > > > On Thu, 2023-12-14 at 08:45 -0500, Michael S. Tsirkin wrote:
> > > > > On Thu, Dec 14, 2023 at 01:39:55PM +, Dragos Tatulea wrote:
> > > > > > On Tue, 2023-12-12 at 15:44 -0800, Si-Wei Liu wrote:
> > > > > > >
> > > > > > > On 12/12/2023 11:21 AM, Eugenio Perez Martin wrote:
> > > > > > > > On Tue, Dec 5, 2023 at 11:46 AM Dragos Tatulea 
> > > > > > > >  wrote:
> > > > > > > > > Addresses get set by .set_vq_address. hw vq addresses will be 
> > > > > > > > > updated on
> > > > > > > > > next modify_virtqueue.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Dragos Tatulea 
> > > > > > > > > Reviewed-by: Gal Pressman 
> > > > > > > > > Acked-by: Eugenio Pérez 
> > > > > > > > I'm kind of ok with this patch and the next one about state, 
> > > > > > > > but I
> > > > > > > > didn't ack them in the previous series.
> > > > > > > >
> > > > > > > > My main concern is that it is not valid to change the vq 
> > > > > > > > address after
> > > > > > > > DRIVER_OK in VirtIO, which vDPA follows. Only memory maps are 
> > > > > > > > ok to
> > > > > > > > change at this moment. I'm not sure about vq state in vDPA, but 
> > > > > > > > vhost
> > > > > > > > forbids changing it with an active backend.
> > > > > > > >
> > > > > > > > Suspend is not defined in VirtIO at this moment though, so 
> > > > > > > > maybe it is
> > > > > > > > ok to decide that all of these parameters may change during 
> > > > > > > > suspend.
> > > > > > > > Maybe the best thing is to protect this with a vDPA feature 
> > > > > > > > flag.
> > > > > > > I think protect with vDPA feature flag could work, while on the 
> > > > > > > other
> > > > > > > hand vDPA means vendor specific optimization is possible around 
> > > > > > > suspend
> > > > > > > and resume (in case it helps performance), which doesn't have to 
> > > > > > > be
> > > > > > > backed by virtio spec. Same applies to vhost user backend 
> > > > > > > features,
> > > > > > > variations there were not backed by spec either. Of course, we 
> > > > > > > should
> > > > > > > try best to make the default behavior backward compatible with
> > > > > > > virtio-based backend, but that circles back to no suspend 
> > > > > > > definition in
> > > > > > > the current virtio spec, for which I hope we don't cease 
> > > > > > > development on
> > > > > > > vDPA indefinitely. After all, the virtio based vdap backend can 
> > > > > > > well
> > > > > > > define its own feature flag to describe (minor difference in) the
> > > > > > > suspend behavior based on the later spec once it is formed in 
> > > > > > > future.
> > > > > > >
> > > > > > So what is the way forward here? From what I understand the options 
> > > > > > are:
> > > > > >
> > > > > > 1) Add a vdpa feature flag for changing device properties while 
> > > > > > suspended.
> > > > > >
> > > > > > 2) Drop these 2 patches from the series for now. Not sure if this 
> > > > > > makes sense as
> > > > > > this. But then Si-Wei's qemu device suspend/resume poc [0] that 
> > > > > > exercises this
> > > > > > code won't work anymore. This means the series would be less well 
> > > > > > tested.
> > > > > >
> > > > > > Are there other possible options? What do you think?
> > > > > >
> > > > > > [0] https://github.com/siwliu-kernel/qemu/tree/svq-resume-wip
> > > > >
> > > > > I am fine with either of these.
> > > > >
> > > > How about allowing the change only under the following conditions:
> > > >   vhost_vdpa_can_suspend && vhost_vdpa_can_resume &&
> > > > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is set
> > > >
> > > > ?
> > >
> > > I think the best option by far is 1, as there is no hint in the
> > > combination of these 3 indicating that you can change device
> > > properties in the suspended state.
> > >
> > Sure. Will respin a v3 without these two patches.
> >
> > Another series can implement option 2 and add these 2 patches on top.
> Hmm...I misunderstood your statement and sent a erroneus v3. You said that
> having a feature flag is the best option.
>
> Will add a feature flag in v4: is this similar to the
> VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK flag?
>

Right, it should be easy to return it from .get_backend_features op if
the FW returns that capability, isn't it?

> Thanks,
> Dragos
>
> > > > Thanks,
> > > > Dragos
> > > >
> > > > > > Thanks,
> > > > > > Dragos
> > > > > >
> > > > > > > Regards,
> > > > > > > -Siwei
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Jason, what do you think?
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > > ---
> > > > > > > > >   drivers/vdpa/mlx5/net/mlx5_vnet.c  | 9 +
> > > > > > > > >   include/linux/mlx5/mlx5_ifc_vdpa.h | 1 +
> > > > > > > > >   2 files changed, 10 

[PATCH v4 15/15] tracing: Update subbuffer with kilobytes not page order

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Using page order for deciding what the size of the ring buffer sub buffers
are is exposing a bit too much of the implementation. Although the sub
buffers are only allocated in orders of pages, allow the user to specify
the minimum size of each sub-buffer via kilobytes like they can with the
buffer size itself.

If the user specifies 3 via:

  echo 3 > buffer_subbuf_size_kb

Then the sub-buffer size will round up to 4kb (on a 4kb page size system).

If they specify:

  echo 6 > buffer_subbuf_size_kb

The sub-buffer size will become 8kb.

and so on.

Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/trace/ftrace.rst| 46 ---
 kernel/trace/trace.c  | 38 +--
 ...fer_order.tc => ringbuffer_subbuf_size.tc} | 18 
 3 files changed, 54 insertions(+), 48 deletions(-)
 rename tools/testing/selftests/ftrace/test.d/00basic/{ringbuffer_order.tc => 
ringbuffer_subbuf_size.tc} (85%)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 765d2ebff991..e4f0c7853449 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -203,32 +203,26 @@ of ftrace. Here is a list of some of the key files:
 
This displays the total combined size of all the trace buffers.
 
-  buffer_subbuf_order:
-
-   This sets or displays the sub buffer page size order. The ring buffer
-   is broken up into several same size "sub buffers". An event can not be
-   bigger than the size of the sub buffer. Normally, the sub buffer is
-   the size of the architecture's page (4K on x86). The sub buffer also
-   contains meta data at the start which also limits the size of an event.
-   That means when the sub buffer is a page size, no event can be larger
-   than the page size minus the sub buffer meta data.
-
-   The buffer_subbuf_order allows the user to change the size of the sub
-   buffer. As the sub buffer is a set of pages by the power of 2, thus
-   the sub buffer total size is defined by the order:
-
-   order   size
-   
-   0   PAGE_SIZE
-   1   PAGE_SIZE * 2
-   2   PAGE_SIZE * 4
-   3   PAGE_SIZE * 8
-
-   Changing the order will change the sub buffer size allowing for events
-   to be larger than the page size.
-
-   Note: When changing the order, tracing is stopped and any data in the
-   ring buffer and the snapshot buffer will be discarded.
+  buffer_subbuf_size_kb:
+
+   This sets or displays the sub buffer size. The ring buffer is broken up
+   into several same size "sub buffers". An event can not be bigger than
+   the size of the sub buffer. Normally, the sub buffer is the size of the
+   architecture's page (4K on x86). The sub buffer also contains meta data
+   at the start which also limits the size of an event.  That means when
+   the sub buffer is a page size, no event can be larger than the page
+   size minus the sub buffer meta data.
+
+   Note, the buffer_subbuf_size_kb is a way for the user to specify the
+   minimum size of the subbuffer. The kernel may make it bigger due to the
+   implementation details, or simply fail the operation if the kernel can
+   not handle the request.
+
+   Changing the sub buffer size allows for events to be larger than the
+   page size.
+
+   Note: When changing the sub-buffer size, tracing is stopped and any
+   data in the ring buffer and the snapshot buffer will be discarded.
 
   free_buffer:
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b4626c9223a5..52310e3c251f 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9391,42 +9391,54 @@ static const struct file_operations buffer_percent_fops 
= {
 };
 
 static ssize_t
-buffer_order_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t 
*ppos)
+buffer_subbuf_size_read(struct file *filp, char __user *ubuf, size_t cnt, 
loff_t *ppos)
 {
struct trace_array *tr = filp->private_data;
+   size_t size;
char buf[64];
+   int order;
int r;
 
-   r = sprintf(buf, "%d\n", 
ring_buffer_subbuf_order_get(tr->array_buffer.buffer));
+   order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
+   size = (PAGE_SIZE << order) / 1024;
+
+   r = sprintf(buf, "%zd\n", size);
 
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
 static ssize_t
-buffer_order_write(struct file *filp, const char __user *ubuf,
-  size_t cnt, loff_t *ppos)
+buffer_subbuf_size_write(struct file *filp, const char __user *ubuf,
+size_t cnt, loff_t *ppos)
 {
struct trace_array *tr = filp->private_data;
unsigned long val;
int old_order;
+   int order;
+   int pages;
int ret;
 
ret = 

[PATCH v4 14/15] ringbuffer/selftest: Add basic selftest to test changing subbuf order

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add a self test that will write into the trace buffer with differ trace
sub buffer order sizes.

Signed-off-by: Steven Rostedt (Google) 
---
 .../ftrace/test.d/00basic/ringbuffer_order.tc | 95 +++
 1 file changed, 95 insertions(+)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc

diff --git a/tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc 
b/tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc
new file mode 100644
index ..ecbcc810e6c1
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc
@@ -0,0 +1,95 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Change the ringbuffer sub-buffer order
+# requires: buffer_subbuf_order
+# flags: instance
+
+get_buffer_data_size() {
+   sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page
+}
+
+get_buffer_data_offset() {
+   sed -ne 's/^.*data.*offset:\([0-9][0-9]*\).*/\1/p' events/header_page
+}
+
+get_event_header_size() {
+   type_len=`sed -ne 's/^.*type_len.*:[^0-9]*\([0-9][0-9]*\).*/\1/p' 
events/header_event`
+   time_len=`sed -ne 's/^.*time_delta.*:[^0-9]*\([0-9][0-9]*\).*/\1/p' 
events/header_event`
+   array_len=`sed -ne 's/^.*array.*:[^0-9]*\([0-9][0-9]*\).*/\1/p' 
events/header_event`
+   total_bits=$((type_len+time_len+array_len))
+   total_bits=$((total_bits+7))
+   echo $((total_bits/8))
+}
+
+get_print_event_buf_offset() {
+   sed -ne 's/^.*buf.*offset:\([0-9][0-9]*\).*/\1/p' 
events/ftrace/print/format
+}
+
+event_header_size=`get_event_header_size`
+print_header_size=`get_print_event_buf_offset`
+
+data_offset=`get_buffer_data_offset`
+
+marker_meta=$((event_header_size+print_header_size))
+
+make_str() {
+cnt=$1
+   printf -- 'X%.0s' $(seq $cnt)
+}
+
+write_buffer() {
+   size=$1
+
+   str=`make_str $size`
+
+   # clear the buffer
+   echo > trace
+
+   # write the string into the marker
+   echo $str > trace_marker
+
+   echo $str
+}
+
+test_buffer() {
+   orde=$1
+   page_size=$((4096< buffer_subbuf_order
+   test_buffer $a
+done
+
+echo $ORIG > buffer_subbuf_order
+
-- 
2.42.0





[PATCH v4 13/15] ring-buffer: Add documentation on the buffer_subbuf_order file

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Add to the documentation how to use the buffer_subbuf_order file to change
the size and how it affects what events can be added to the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/trace/ftrace.rst | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 7fe96da34962..765d2ebff991 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -203,6 +203,33 @@ of ftrace. Here is a list of some of the key files:
 
This displays the total combined size of all the trace buffers.
 
+  buffer_subbuf_order:
+
+   This sets or displays the sub buffer page size order. The ring buffer
+   is broken up into several same size "sub buffers". An event can not be
+   bigger than the size of the sub buffer. Normally, the sub buffer is
+   the size of the architecture's page (4K on x86). The sub buffer also
+   contains meta data at the start which also limits the size of an event.
+   That means when the sub buffer is a page size, no event can be larger
+   than the page size minus the sub buffer meta data.
+
+   The buffer_subbuf_order allows the user to change the size of the sub
+   buffer. As the sub buffer is a set of pages by the power of 2, thus
+   the sub buffer total size is defined by the order:
+
+   order   size
+   
+   0   PAGE_SIZE
+   1   PAGE_SIZE * 2
+   2   PAGE_SIZE * 4
+   3   PAGE_SIZE * 8
+
+   Changing the order will change the sub buffer size allowing for events
+   to be larger than the page size.
+
+   Note: When changing the order, tracing is stopped and any data in the
+   ring buffer and the snapshot buffer will be discarded.
+
   free_buffer:
 
If a process is performing tracing, and the ring buffer should be
-- 
2.42.0





[PATCH v4 12/15] ring-buffer: Just update the subbuffers when changing their allocation order

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The ring_buffer_subbuf_order_set() was creating ring_buffer_per_cpu
cpu_buffers with the new subbuffers with the updated order, and if they
all successfully were created, then they the ring_buffer's per_cpu buffers
would be freed and replaced by them.

The problem is that the freed per_cpu buffers contains state that would be
lost. Running the following commands:

1. # echo 3 > /sys/kernel/tracing/buffer_subbuf_order
2. # echo 0 > /sys/kernel/tracing/tracing_cpumask
3. # echo 1 > /sys/kernel/tracing/snapshot
4. # echo ff > /sys/kernel/tracing/tracing_cpumask
5. # echo test > /sys/kernel/tracing/trace_marker

Would result in:

 -bash: echo: write error: Bad file descriptor

That's because the state of the per_cpu buffers of the snapshot buffer is
lost when the order is changed (the order of a freed snapshot buffer goes
to 0 to save memory, and when the snapshot buffer is allocated again, it
goes back to what the main buffer is).

In operation 2, the snapshot buffers were set to "disable" (as all the
ring buffers CPUs were disabled).

In operation 3, the snapshot is allocated and a call to
ring_buffer_subbuf_order_set() replaced the per_cpu buffers losing the
"record_disable" count.

When it was enabled again, the atomic_dec(_buffer->record_disable) was
decrementing a zero, setting it to -1. Writing 1 into the snapshot would
swap the snapshot buffer with the main buffer, so now the main buffer is
"disabled", and nothing can write to the ring buffer anymore.

Instead of creating new per_cpu buffers and losing the state of the old
buffers, basically do what the resize does and just allocate new subbuf
pages into the new_pages link list of the per_cpu buffer and if they all
succeed, then replace the old sub buffers with the new ones. This keeps
the per_cpu buffer descriptor in tact and by doing so, keeps its state.

Fixes: TBD ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 88 ++
 1 file changed, 71 insertions(+), 17 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a93dbbcea83d..9b95297339b6 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5789,11 +5789,11 @@ EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get);
  */
 int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
 {
-   struct ring_buffer_per_cpu **cpu_buffers;
+   struct ring_buffer_per_cpu *cpu_buffer;
+   struct buffer_page *bpage, *tmp;
int old_order, old_size;
int nr_pages;
int psize;
-   int bsize;
int err;
int cpu;
 
@@ -5807,11 +5807,6 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
if (psize <= BUF_PAGE_HDR_SIZE)
return -EINVAL;
 
-   bsize = sizeof(void *) * buffer->cpus;
-   cpu_buffers = kzalloc(bsize, GFP_KERNEL);
-   if (!cpu_buffers)
-   return -ENOMEM;
-
old_order = buffer->subbuf_order;
old_size = buffer->subbuf_size;
 
@@ -5827,33 +5822,88 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
 
/* Make sure all new buffers are allocated, before deleting the old 
ones */
for_each_buffer_cpu(buffer, cpu) {
+
if (!cpumask_test_cpu(cpu, buffer->cpumask))
continue;
 
+   cpu_buffer = buffer->buffers[cpu];
+
/* Update the number of pages to match the new size */
nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
 
-   cpu_buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, 
cpu);
-   if (!cpu_buffers[cpu]) {
+   /* we need a minimum of two pages */
+   if (nr_pages < 2)
+   nr_pages = 2;
+
+   cpu_buffer->nr_pages_to_update = nr_pages;
+
+   /* Include the reader page */
+   nr_pages++;
+
+   /* Allocate the new size buffer */
+   INIT_LIST_HEAD(_buffer->new_pages);
+   if (__rb_allocate_pages(cpu_buffer, nr_pages,
+   _buffer->new_pages)) {
+   /* not enough memory for new pages */
err = -ENOMEM;
goto error;
}
}
 
for_each_buffer_cpu(buffer, cpu) {
+
if (!cpumask_test_cpu(cpu, buffer->cpumask))
continue;
 
-   rb_free_cpu_buffer(buffer->buffers[cpu]);
-   buffer->buffers[cpu] = cpu_buffers[cpu];
+   cpu_buffer = buffer->buffers[cpu];
+
+   /* Clear the head bit to make the link list normal to read */
+   rb_head_page_deactivate(cpu_buffer);
+
+   /* Now walk the list and free all the 

[PATCH v4 11/15] ring-buffer: Keep the same size when updating the order

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The function ring_buffer_subbuf_order_set() just updated the sub-buffers
to the new size, but this also changes the size of the buffer in doing so.
As the size is determined by nr_pages * subbuf_size. If the subbuf_size is
increased without decreasing the nr_pages, this causes the total size of
the buffer to increase.

This broke the latency tracers as the snapshot needs to be the same size
as the main buffer. The size of the snapshot buffer is only expanded when
needed, and because the order is still the same, the size becomes out of
sync with the main buffer, as the main buffer increased in size without
the tracing system knowing.

Calculate the nr_pages to allocate with the new subbuf_size to be
buffer_size / new_subbuf_size.

Fixes: TBD ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 695b64fbc1cb..a93dbbcea83d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5830,7 +5830,10 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
if (!cpumask_test_cpu(cpu, buffer->cpumask))
continue;
 
-   nr_pages = buffer->buffers[cpu]->nr_pages;
+   /* Update the number of pages to match the new size */
+   nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
+   nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
+
cpu_buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, 
cpu);
if (!cpu_buffers[cpu]) {
err = -ENOMEM;
-- 
2.42.0





[PATCH v4 09/15] tracing: Update snapshot order along with main buffer order

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

When updating the order of the sub buffers for the main buffer, make sure
that if the snapshot buffer exists, that it gets its order updated as
well.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 45 ++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index c148797afe6d..836514e566c7 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1263,10 +1263,17 @@ static void set_buffer_entries(struct array_buffer 
*buf, unsigned long val);
 
 int tracing_alloc_snapshot_instance(struct trace_array *tr)
 {
+   int order;
int ret;
 
if (!tr->allocated_snapshot) {
 
+   /* Make the snapshot buffer have the same order as main buffer 
*/
+   order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
+   ret = ring_buffer_subbuf_order_set(tr->max_buffer.buffer, 
order);
+   if (ret < 0)
+   return ret;
+
/* allocate spare buffer */
ret = resize_buffer_duplicate_size(>max_buffer,
   >array_buffer, RING_BUFFER_ALL_CPUS);
@@ -1286,6 +1293,7 @@ static void free_snapshot(struct trace_array *tr)
 * The max_tr ring buffer has some state (e.g. ring->clock) and
 * we want preserve it.
 */
+   ring_buffer_subbuf_order_set(tr->max_buffer.buffer, 0);
ring_buffer_resize(tr->max_buffer.buffer, 1, RING_BUFFER_ALL_CPUS);
set_buffer_entries(>max_buffer, 1);
tracing_reset_online_cpus(>max_buffer);
@@ -9400,6 +9408,7 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
 {
struct trace_array *tr = filp->private_data;
unsigned long val;
+   int old_order;
int ret;
 
ret = kstrtoul_from_user(ubuf, cnt, 10, );
@@ -9410,12 +9419,44 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
if (val < 0 || val > 7)
return -EINVAL;
 
+   old_order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
+   if (old_order == val)
+   return 0;
+
ret = ring_buffer_subbuf_order_set(tr->array_buffer.buffer, val);
if (ret)
-   return ret;
+   return 0;
 
-   (*ppos)++;
+#ifdef CONFIG_TRACER_MAX_TRACE
+
+   if (!tr->allocated_snapshot)
+   goto out_max;
 
+   ret = ring_buffer_subbuf_order_set(tr->max_buffer.buffer, val);
+   if (ret) {
+   /* Put back the old order */
+   cnt = ring_buffer_subbuf_order_set(tr->array_buffer.buffer, 
old_order);
+   if (WARN_ON_ONCE(cnt)) {
+   /*
+* AARGH! We are left with different orders!
+* The max buffer is our "snapshot" buffer.
+* When a tracer needs a snapshot (one of the
+* latency tracers), it swaps the max buffer
+* with the saved snap shot. We succeeded to
+* update the order of the main buffer, but failed to
+* update the order of the max buffer. But when we tried
+* to reset the main buffer to the original size, we
+* failed there too. This is very unlikely to
+* happen, but if it does, warn and kill all
+* tracing.
+*/
+   tracing_disabled = 1;
+   }
+   return ret;
+   }
+ out_max:
+#endif
+   (*ppos)++;
return cnt;
 }
 
-- 
2.42.0





[PATCH v4 10/15] tracing: Stop the tracing while changing the ring buffer subbuf size

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Because the main buffer and the snapshot buffer need to be the same for
some tracers, otherwise it will fail and disable all tracing, the tracers
need to be stopped while updating the sub buffer sizes so that the tracers
see the main and snapshot buffers with the same sub buffer size.

Fixes: TBD ("ring-buffer: Add interface for configuring trace sub buffer size")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 836514e566c7..b4626c9223a5 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9419,13 +9419,16 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
if (val < 0 || val > 7)
return -EINVAL;
 
+   /* Do not allow tracing while changing the order of the ring buffer */
+   tracing_stop_tr(tr);
+
old_order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
if (old_order == val)
-   return 0;
+   goto out;
 
ret = ring_buffer_subbuf_order_set(tr->array_buffer.buffer, val);
if (ret)
-   return 0;
+   goto out;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
 
@@ -9452,11 +9455,15 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
 */
tracing_disabled = 1;
}
-   return ret;
+   goto out;
}
  out_max:
 #endif
(*ppos)++;
+ out:
+   if (ret)
+   cnt = ret;
+   tracing_start_tr(tr);
return cnt;
 }
 
-- 
2.42.0





[PATCH v4 08/15] ring-buffer: Make sure the spare sub buffer used for reads has same size

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Now that the ring buffer specifies the size of its sub buffers, they all
need to be the same size. When doing a read, a swap is done with a spare
page. Make sure they are the same size before doing the swap, otherwise
the read will fail.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index d4d4e7c7357f..c148797afe6d 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7589,6 +7589,7 @@ struct ftrace_buffer_info {
struct trace_iterator   iter;
void*spare;
unsigned intspare_cpu;
+   unsigned intspare_size;
unsigned intread;
 };
 
@@ -8308,6 +8309,15 @@ tracing_buffers_read(struct file *filp, char __user 
*ubuf,
 
page_size = ring_buffer_subbuf_size_get(iter->array_buffer->buffer);
 
+   /* Make sure the spare matches the current sub buffer size */
+   if (info->spare) {
+   if (page_size != info->spare_size) {
+   ring_buffer_free_read_page(iter->array_buffer->buffer,
+  info->spare_cpu, 
info->spare);
+   info->spare = NULL;
+   }
+   }
+
if (!info->spare) {
info->spare = 
ring_buffer_alloc_read_page(iter->array_buffer->buffer,
  iter->cpu_file);
@@ -8316,6 +8326,7 @@ tracing_buffers_read(struct file *filp, char __user *ubuf,
info->spare = NULL;
} else {
info->spare_cpu = iter->cpu_file;
+   info->spare_size = page_size;
}
}
if (!info->spare)
-- 
2.42.0





[PATCH v4 06/15] ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

On failure to allocate ring buffer pages, the pointer to the CPU buffer
pages is freed, but the pages that were allocated previously were not.
Make sure they are freed too.

Fixes: TBD ("tracing: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 57b9d3f5f32e..dd03887a4737 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5860,6 +5860,7 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
for_each_buffer_cpu(buffer, cpu) {
if (!cpu_buffers[cpu])
continue;
+   rb_free_cpu_buffer(cpu_buffers[cpu]);
kfree(cpu_buffers[cpu]);
}
kfree(cpu_buffers);
-- 
2.42.0





[PATCH v4 07/15] ring-buffer: Do no swap cpu buffers if order is different

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

As all the subbuffer order (subbuffer sizes) must be the same throughout
the ring buffer, check the order of the buffers that are doing a CPU
buffer swap in ring_buffer_swap_cpu() to make sure they are the same.

If the are not the same, then fail to do the swap, otherwise the ring
buffer will think the CPU buffer has a specific subbuffer size when it
does not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index dd03887a4737..695b64fbc1cb 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5350,6 +5350,9 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
goto out;
 
+   if (buffer_a->subbuf_order != buffer_b->subbuf_order)
+   goto out;
+
ret = -EAGAIN;
 
if (atomic_read(_a->record_disabled))
-- 
2.42.0





[PATCH v4 05/15] ring-buffer: Read and write to ring buffers with custom sub buffer size

2023-12-15 Thread Steven Rostedt
From: "Tzvetomir Stoyanov (VMware)" 

As the size of the ring sub buffer page can be changed dynamically,
the logic that reads and writes to the buffer should be fixed to take
that into account. Some internal ring buffer APIs are changed:
 ring_buffer_alloc_read_page()
 ring_buffer_free_read_page()
 ring_buffer_read_page()
A new API is introduced:
 ring_buffer_read_page_data()

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-6-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  | 11 ++--
 kernel/trace/ring_buffer.c   | 75 
 kernel/trace/ring_buffer_benchmark.c | 10 ++--
 kernel/trace/trace.c | 34 +++--
 4 files changed, 89 insertions(+), 41 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 12573306b889..fa802db216f9 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -192,10 +192,15 @@ bool ring_buffer_time_stamp_abs(struct trace_buffer 
*buffer);
 size_t ring_buffer_nr_pages(struct trace_buffer *buffer, int cpu);
 size_t ring_buffer_nr_dirty_pages(struct trace_buffer *buffer, int cpu);
 
-void *ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu);
-void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu, void 
*data);
-int ring_buffer_read_page(struct trace_buffer *buffer, void **data_page,
+struct buffer_data_read_page;
+struct buffer_data_read_page *
+ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu);
+void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
+   struct buffer_data_read_page *page);
+int ring_buffer_read_page(struct trace_buffer *buffer,
+ struct buffer_data_read_page *data_page,
  size_t len, int cpu, int full);
+void *ring_buffer_read_page_data(struct buffer_data_read_page *page);
 
 struct trace_seq;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 27e72cc87daa..57b9d3f5f32e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -317,6 +317,11 @@ struct buffer_data_page {
unsigned chardata[] RB_ALIGN_DATA;  /* data of buffer page */
 };
 
+struct buffer_data_read_page {
+   unsignedorder;  /* order of the page */
+   struct buffer_data_page *data;  /* actual data, stored in this page */
+};
+
 /*
  * Note, the buffer_page list must be first. The buffer pages
  * are allocated in cache lines, which means that each buffer
@@ -5416,40 +5421,48 @@ EXPORT_SYMBOL_GPL(ring_buffer_swap_cpu);
  * Returns:
  *  The page allocated, or ERR_PTR
  */
-void *ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
+struct buffer_data_read_page *
+ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
 {
struct ring_buffer_per_cpu *cpu_buffer;
-   struct buffer_data_page *bpage = NULL;
+   struct buffer_data_read_page *bpage = NULL;
unsigned long flags;
struct page *page;
 
if (!cpumask_test_cpu(cpu, buffer->cpumask))
return ERR_PTR(-ENODEV);
 
+   bpage = kzalloc(sizeof(*bpage), GFP_KERNEL);
+   if (!bpage)
+   return ERR_PTR(-ENOMEM);
+
+   bpage->order = buffer->subbuf_order;
cpu_buffer = buffer->buffers[cpu];
local_irq_save(flags);
arch_spin_lock(_buffer->lock);
 
if (cpu_buffer->free_page) {
-   bpage = cpu_buffer->free_page;
+   bpage->data = cpu_buffer->free_page;
cpu_buffer->free_page = NULL;
}
 
arch_spin_unlock(_buffer->lock);
local_irq_restore(flags);
 
-   if (bpage)
+   if (bpage->data)
goto out;
 
page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
cpu_buffer->buffer->subbuf_order);
-   if (!page)
+   if (!page) {
+   kfree(bpage);
return ERR_PTR(-ENOMEM);
+   }
 
-   bpage = page_address(page);
+   bpage->data = page_address(page);
 
  out:
-   rb_init_page(bpage);
+   rb_init_page(bpage->data);
 
return bpage;
 }
@@ -5459,14 +5472,15 @@ EXPORT_SYMBOL_GPL(ring_buffer_alloc_read_page);
  * ring_buffer_free_read_page - free an allocated read page
  * @buffer: the buffer the page was allocate for
  * @cpu: the cpu buffer the page came from
- * @data: the page to free
+ * @page: the page to free
  *
  * Free a page allocated from ring_buffer_alloc_read_page.
  */
-void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu, void 
*data)
+void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
+   struct buffer_data_read_page *data_page)
 {
struct ring_buffer_per_cpu *cpu_buffer;
-   struct buffer_data_page *bpage = data;
+   struct 

[PATCH v4 04/15] ring-buffer: Set new size of the ring buffer sub page

2023-12-15 Thread Steven Rostedt
From: "Tzvetomir Stoyanov (VMware)" 

There are two approaches when changing the size of the ring buffer
sub page:
 1. Destroying all pages and allocating new pages with the new size.
 2. Allocating new pages, copying the content of the old pages before
destroying them.
The first approach is easier, it is selected in the proposed
implementation. Changing the ring buffer sub page size is supposed to
not happen frequently. Usually, that size should be set only once,
when the buffer is not in use yet and is supposed to be empty.

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-5-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 80 ++
 1 file changed, 73 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7774202d7212..27e72cc87daa 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -331,6 +331,7 @@ struct buffer_page {
unsigned read;  /* index for next read */
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
+   unsigned order; /* order of the page */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -361,7 +362,7 @@ static __always_inline unsigned int rb_page_commit(struct 
buffer_page *bpage)
 
 static void free_buffer_page(struct buffer_page *bpage)
 {
-   free_page((unsigned long)bpage->page);
+   free_pages((unsigned long)bpage->page, bpage->order);
kfree(bpage);
 }
 
@@ -1481,10 +1482,12 @@ static int __rb_allocate_pages(struct 
ring_buffer_per_cpu *cpu_buffer,
 
list_add(>list, pages);
 
-   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags, 
0);
+   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
bpage->page = page_address(page);
+   bpage->order = cpu_buffer->buffer->subbuf_order;
rb_init_page(bpage->page);
 
if (user_thread && fatal_signal_pending(current))
@@ -1563,7 +1566,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
rb_check_bpage(cpu_buffer, bpage);
 
cpu_buffer->reader_page = bpage;
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 
cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -1647,6 +1651,7 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long 
size, unsigned flags,
goto fail_free_buffer;
 
/* Default buffer page size - one system page */
+   buffer->subbuf_order = 0;
buffer->subbuf_size = PAGE_SIZE - BUF_PAGE_HDR_SIZE;
 
/* Max payload is buffer page size - header (8bytes) */
@@ -5436,8 +5441,8 @@ void *ring_buffer_alloc_read_page(struct trace_buffer 
*buffer, int cpu)
if (bpage)
goto out;
 
-   page = alloc_pages_node(cpu_to_node(cpu),
-   GFP_KERNEL | __GFP_NORETRY, 0);
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
return ERR_PTR(-ENOMEM);
 
@@ -5486,7 +5491,7 @@ void ring_buffer_free_read_page(struct trace_buffer 
*buffer, int cpu, void *data
local_irq_restore(flags);
 
  out:
-   free_page((unsigned long)bpage);
+   free_pages((unsigned long)bpage, buffer->subbuf_order);
 }
 EXPORT_SYMBOL_GPL(ring_buffer_free_read_page);
 
@@ -5746,7 +5751,13 @@ EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get);
  */
 int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
 {
+   struct ring_buffer_per_cpu **cpu_buffers;
+   int old_order, old_size;
+   int nr_pages;
int psize;
+   int bsize;
+   int err;
+   int cpu;
 
if (!buffer || order < 0)
return -EINVAL;
@@ -5758,12 +5769,67 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
if (psize <= BUF_PAGE_HDR_SIZE)
return -EINVAL;
 
+   bsize = sizeof(void *) * buffer->cpus;
+   cpu_buffers = kzalloc(bsize, GFP_KERNEL);
+   if (!cpu_buffers)
+   return -ENOMEM;
+
+   old_order = buffer->subbuf_order;
+   old_size = buffer->subbuf_size;
+
+   /* prevent another thread from changing buffer sizes */
+   mutex_lock(>mutex);
+   atomic_inc(>record_disabled);
+
+   /* Make sure all commits have finished */
+   synchronize_rcu();
+
buffer->subbuf_order = order;
 

[PATCH v4 03/15] ring-buffer: Add interface for configuring trace sub buffer size

2023-12-15 Thread Steven Rostedt
From: "Tzvetomir Stoyanov (VMware)" 

The trace ring buffer sub page size can be configured, per trace
instance. A new ftrace file "buffer_subbuf_order" is added to get and
set the size of the ring buffer sub page for current trace instance.
The size must be an order of system page size, that's why the new
interface works with system page order, instead of absolute page size:
0 means the ring buffer sub page is equal to 1 system page and so
forth:
0 - 1 system page
1 - 2 system pages
2 - 4 system pages
...
The ring buffer sub page size is limited between 1 and 128 system
pages. The default value is 1 system page.
New ring buffer APIs are introduced:
 ring_buffer_subbuf_order_set()
 ring_buffer_subbuf_order_get()
 ring_buffer_subbuf_size_get()

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-4-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  4 ++
 kernel/trace/ring_buffer.c  | 73 +
 kernel/trace/trace.c| 48 
 3 files changed, 125 insertions(+)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index ce46218ce46d..12573306b889 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -202,6 +202,10 @@ struct trace_seq;
 int ring_buffer_print_entry_header(struct trace_seq *s);
 int ring_buffer_print_page_header(struct trace_buffer *buffer, struct 
trace_seq *s);
 
+int ring_buffer_subbuf_order_get(struct trace_buffer *buffer);
+int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order);
+int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
+
 enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
 };
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b4b13e3244a9..7774202d7212 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -506,6 +506,7 @@ struct trace_buffer {
booltime_stamp_abs;
 
unsigned intsubbuf_size;
+   unsigned intsubbuf_order;
unsigned intmax_data_size;
 };
 
@@ -5694,6 +5695,78 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 }
 EXPORT_SYMBOL_GPL(ring_buffer_read_page);
 
+/**
+ * ring_buffer_subbuf_size_get - get size of the sub buffer.
+ * @buffer: the buffer to get the sub buffer size from
+ *
+ * Returns size of the sub buffer, in bytes.
+ */
+int ring_buffer_subbuf_size_get(struct trace_buffer *buffer)
+{
+   return buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_subbuf_size_get);
+
+/**
+ * ring_buffer_subbuf_order_get - get order of system sub pages in one buffer 
page.
+ * @buffer: The ring_buffer to get the system sub page order from
+ *
+ * By default, one ring buffer sub page equals to one system page. This 
parameter
+ * is configurable, per ring buffer. The size of the ring buffer sub page can 
be
+ * extended, but must be an order of system page size.
+ *
+ * Returns the order of buffer sub page size, in system pages:
+ * 0 means the sub buffer size is 1 system page and so forth.
+ * In case of an error < 0 is returned.
+ */
+int ring_buffer_subbuf_order_get(struct trace_buffer *buffer)
+{
+   if (!buffer)
+   return -EINVAL;
+
+   return buffer->subbuf_order;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get);
+
+/**
+ * ring_buffer_subbuf_order_set - set the size of ring buffer sub page.
+ * @buffer: The ring_buffer to set the new page size.
+ * @order: Order of the system pages in one sub buffer page
+ *
+ * By default, one ring buffer pages equals to one system page. This API can be
+ * used to set new size of the ring buffer page. The size must be order of
+ * system page size, that's why the input parameter @order is the order of
+ * system pages that are allocated for one ring buffer page:
+ *  0 - 1 system page
+ *  1 - 2 system pages
+ *  3 - 4 system pages
+ *  ...
+ *
+ * Returns 0 on success or < 0 in case of an error.
+ */
+int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
+{
+   int psize;
+
+   if (!buffer || order < 0)
+   return -EINVAL;
+
+   if (buffer->subbuf_order == order)
+   return 0;
+
+   psize = (1 << order) * PAGE_SIZE;
+   if (psize <= BUF_PAGE_HDR_SIZE)
+   return -EINVAL;
+
+   buffer->subbuf_order = order;
+   buffer->subbuf_size = psize - BUF_PAGE_HDR_SIZE;
+
+   /* Todo: reset the buffer with the new page size */
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
+
 /*
  * We only allocate new buffers, never free them if the CPU goes down.
  * If we were to free the buffer, then the user would lose any trace that was 
in
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 45ed9cb5399b..88d7feca0a25 100644
--- a/kernel/trace/trace.c
+++ 

[PATCH v4 02/15] ring-buffer: Page size per ring buffer

2023-12-15 Thread Steven Rostedt
From: "Tzvetomir Stoyanov (VMware)" 

Currently the size of one sub buffer page is global for all buffers and
it is hard coded to one system page. In order to introduce configurable
ring buffer sub page size, the internal logic should be refactored to
work with sub page size per ring buffer.

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-3-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  2 +-
 kernel/trace/ring_buffer.c  | 68 +
 kernel/trace/trace.c|  2 +-
 kernel/trace/trace.h|  1 +
 kernel/trace/trace_events.c | 59 
 5 files changed, 86 insertions(+), 46 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index b1b03b2c0f08..ce46218ce46d 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -200,7 +200,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, void 
**data_page,
 struct trace_seq;
 
 int ring_buffer_print_entry_header(struct trace_seq *s);
-int ring_buffer_print_page_header(struct trace_seq *s);
+int ring_buffer_print_page_header(struct trace_buffer *buffer, struct 
trace_seq *s);
 
 enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 196e3f284723..b4b13e3244a9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -373,11 +373,6 @@ static inline bool test_time_stamp(u64 delta)
return !!(delta & TS_DELTA_TEST);
 }
 
-#define BUF_PAGE_SIZE (PAGE_SIZE - BUF_PAGE_HDR_SIZE)
-
-/* Max payload is BUF_PAGE_SIZE - header (8bytes) */
-#define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
-
 struct rb_irq_work {
struct irq_work work;
wait_queue_head_t   waiters;
@@ -509,6 +504,9 @@ struct trace_buffer {
 
struct rb_irq_work  irq_work;
booltime_stamp_abs;
+
+   unsigned intsubbuf_size;
+   unsigned intmax_data_size;
 };
 
 struct ring_buffer_iter {
@@ -522,10 +520,11 @@ struct ring_buffer_iter {
u64 read_stamp;
u64 page_stamp;
struct ring_buffer_event*event;
+   size_t  event_size;
int missed_events;
 };
 
-int ring_buffer_print_page_header(struct trace_seq *s)
+int ring_buffer_print_page_header(struct trace_buffer *buffer, struct 
trace_seq *s)
 {
struct buffer_data_page field;
 
@@ -549,7 +548,7 @@ int ring_buffer_print_page_header(struct trace_seq *s)
trace_seq_printf(s, "\tfield: char data;\t"
 "offset:%u;\tsize:%u;\tsigned:%u;\n",
 (unsigned int)offsetof(typeof(field), data),
-(unsigned int)BUF_PAGE_SIZE,
+(unsigned int)buffer->subbuf_size,
 (unsigned int)is_signed_type(char));
 
return !trace_seq_has_overflowed(s);
@@ -1646,7 +1645,13 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long 
size, unsigned flags,
if (!zalloc_cpumask_var(>cpumask, GFP_KERNEL))
goto fail_free_buffer;
 
-   nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+   /* Default buffer page size - one system page */
+   buffer->subbuf_size = PAGE_SIZE - BUF_PAGE_HDR_SIZE;
+
+   /* Max payload is buffer page size - header (8bytes) */
+   buffer->max_data_size = buffer->subbuf_size - (sizeof(u32) * 2);
+
+   nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size);
buffer->flags = flags;
buffer->clock = trace_clock_local;
buffer->reader_lock_key = key;
@@ -1965,7 +1970,7 @@ static void update_pages_handler(struct work_struct *work)
  * @size: the new size.
  * @cpu_id: the cpu buffer to resize
  *
- * Minimum size is 2 * BUF_PAGE_SIZE.
+ * Minimum size is 2 * buffer->subbuf_size.
  *
  * Returns 0 on success and < 0 on failure.
  */
@@ -1987,7 +1992,7 @@ int ring_buffer_resize(struct trace_buffer *buffer, 
unsigned long size,
!cpumask_test_cpu(cpu_id, buffer->cpumask))
return 0;
 
-   nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+   nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size);
 
/* we need a minimum of two pages */
if (nr_pages < 2)
@@ -2234,7 +2239,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 */
barrier();
 
-   if ((iter->head + length) > commit || length > BUF_PAGE_SIZE)
+   if ((iter->head + length) > commit || length > iter->event_size)
/* Writer corrupted the read? */
goto reset;
 
@@ -2467,6 +2472,7 @@ static inline void
 rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
  unsigned long tail, 

[PATCH v4 01/15] ring-buffer: Refactor ring buffer implementation

2023-12-15 Thread Steven Rostedt
From: "Tzvetomir Stoyanov (VMware)" 

In order to introduce sub-buffer size per ring buffer, some internal
refactoring is needed. As ring_buffer_print_page_header() will depend on
the trace_buffer structure, it is moved after the structure definition.

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-2-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 60 +++---
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 47f9eda99769..196e3f284723 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -378,36 +378,6 @@ static inline bool test_time_stamp(u64 delta)
 /* Max payload is BUF_PAGE_SIZE - header (8bytes) */
 #define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
 
-int ring_buffer_print_page_header(struct trace_seq *s)
-{
-   struct buffer_data_page field;
-
-   trace_seq_printf(s, "\tfield: u64 timestamp;\t"
-"offset:0;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)sizeof(field.time_stamp),
-(unsigned int)is_signed_type(u64));
-
-   trace_seq_printf(s, "\tfield: local_t commit;\t"
-"offset:%u;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)offsetof(typeof(field), commit),
-(unsigned int)sizeof(field.commit),
-(unsigned int)is_signed_type(long));
-
-   trace_seq_printf(s, "\tfield: int overwrite;\t"
-"offset:%u;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)offsetof(typeof(field), commit),
-1,
-(unsigned int)is_signed_type(long));
-
-   trace_seq_printf(s, "\tfield: char data;\t"
-"offset:%u;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)offsetof(typeof(field), data),
-(unsigned int)BUF_PAGE_SIZE,
-(unsigned int)is_signed_type(char));
-
-   return !trace_seq_has_overflowed(s);
-}
-
 struct rb_irq_work {
struct irq_work work;
wait_queue_head_t   waiters;
@@ -555,6 +525,36 @@ struct ring_buffer_iter {
int missed_events;
 };
 
+int ring_buffer_print_page_header(struct trace_seq *s)
+{
+   struct buffer_data_page field;
+
+   trace_seq_printf(s, "\tfield: u64 timestamp;\t"
+"offset:0;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)sizeof(field.time_stamp),
+(unsigned int)is_signed_type(u64));
+
+   trace_seq_printf(s, "\tfield: local_t commit;\t"
+"offset:%u;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)offsetof(typeof(field), commit),
+(unsigned int)sizeof(field.commit),
+(unsigned int)is_signed_type(long));
+
+   trace_seq_printf(s, "\tfield: int overwrite;\t"
+"offset:%u;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)offsetof(typeof(field), commit),
+1,
+(unsigned int)is_signed_type(long));
+
+   trace_seq_printf(s, "\tfield: char data;\t"
+"offset:%u;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)offsetof(typeof(field), data),
+(unsigned int)BUF_PAGE_SIZE,
+(unsigned int)is_signed_type(char));
+
+   return !trace_seq_has_overflowed(s);
+}
+
 static inline void rb_time_read(rb_time_t *t, u64 *ret)
 {
*ret = local64_read(>time);
-- 
2.42.0





[PATCH v4 00/15] ring-buffer/tracing: Allow ring buffer to have bigger sub buffers

2023-12-15 Thread Steven Rostedt


[
  Nothing changed since v3, just a rebase on trace/core
]

Note, this has been on my todo list since the ring buffer was created back
in 2008.

Tzvetomir last worked on this in 2021 and I need to finally get it in.

His last series was:

  
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-1-tz.stoya...@gmail.com/

With the description of:

   Currently the size of one sub buffer page is global for all buffers and
   it is hard coded to one system page. The patch set introduces configurable
   ring buffer sub page size, per ring buffer. A new user space interface is
   introduced, which allows to change the sub page size of the ftrace buffer,
   per ftrace instance.

I'm pulling in his patches mostly untouched, except that I had to tweak
a few things to forward port them.

The issues I found I added as the last 7 patches to the series, and then
I added documentation and a selftest, and then changed the UI file
from buffer_subbuf_order to buffer_subbuf_size_kb.

Basically, events to the tracing subsystem are limited to just under a
PAGE_SIZE, as the ring buffer is split into "sub buffers" of one page
size, and an event can not be bigger than a sub buffer. This allows users
to change the size of a sub buffer by the order:

  echo 3 > /sys/kernel/tracing/buffer_subbuf_order

[
  last patch updates this to:

  echo 32 > /sys/kernel/tracing/buffer_subbuf_size_kb

]
  

Will make each sub buffer a size of 8 pages (32KB), allowing events to be
almost as big as 8 pages in size (sub buffers do have meta data on them as
well, keeping an event from reaching the same size as a sub buffer).

Changes since v3: 
https://lore.kernel.org/all/20231213181737.791847...@goodmis.org/

- Rebase on trace/core:

  
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=trace/core=59c28cc95f0a9f5556561e2416af4ecf86184b71

Changes since v2: 
https://lore.kernel.org/all/20231213021914.361709...@goodmis.org/

- Fixed up the subbuf tests to ignore multiple spaces after before the
  'buf' string (same fix as was done for the trace_marker test).

- This time include Cc'ing linux-trace-kernel :-p

Changes since v1: 
https://lore.kernel.org/all/20231210040448.569462...@goodmis.org/

- Add the last patch that changes the ABI from a file called:

  buffer_subbuf_order  to   buffer_subbuf_size_kb

  That is, I kept the old interface the same, but then added the last
  patch that converts the interface into the one that makes more sense.
  I like keeping this in the git history, especially because of the
  way the implemantion is.

- I rebased on top of trace/core in the:

git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git

- I made the tests a bit more advanced. Still a smoke test, but it
  now checks if the string written is the same as the string read.


Steven Rostedt (Google) (10):
  ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() 
failure
  ring-buffer: Do no swap cpu buffers if order is different
  ring-buffer: Make sure the spare sub buffer used for reads has same size
  tracing: Update snapshot order along with main buffer order
  tracing: Stop the tracing while changing the ring buffer subbuf size
  ring-buffer: Keep the same size when updating the order
  ring-buffer: Just update the subbuffers when changing their allocation 
order
  ring-buffer: Add documentation on the buffer_subbuf_order file
  ringbuffer/selftest: Add basic selftest to test changing subbuf order
  tracing: Update subbuffer with kilobytes not page order

Tzvetomir Stoyanov (VMware) (5):
  ring-buffer: Refactor ring buffer implementation
  ring-buffer: Page size per ring buffer
  ring-buffer: Add interface for configuring trace sub buffer size
  ring-buffer: Set new size of the ring buffer sub page
  ring-buffer: Read and write to ring buffers with custom sub buffer size


 Documentation/trace/ftrace.rst |  21 ++
 include/linux/ring_buffer.h|  17 +-
 kernel/trace/ring_buffer.c | 409 -
 kernel/trace/ring_buffer_benchmark.c   |  10 +-
 kernel/trace/trace.c   | 155 +++-
 kernel/trace/trace.h   |   1 +
 kernel/trace/trace_events.c|  59 ++-
 .../test.d/00basic/ringbuffer_subbuf_size.tc   |  95 +
 8 files changed, 647 insertions(+), 120 deletions(-)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_subbuf_size.tc



Re: [PATCH v6 2/4] dax/bus: Use guard(device) in sysfs attribute helpers

2023-12-15 Thread gre...@linuxfoundation.org
On Fri, Dec 15, 2023 at 05:32:50PM +, Verma, Vishal L wrote:
> On Fri, 2023-12-15 at 09:15 -0800, Dan Williams wrote:
> > Greg Kroah-Hartman wrote:
> > > On Thu, Dec 14, 2023 at 10:25:27PM -0700, Vishal Verma wrote:
> > > > Use the guard(device) macro to lock a 'struct device', and unlock it
> > > > automatically when going out of scope using Scope Based Resource
> > > > Management semantics. A lot of the sysfs attribute writes in
> > > > drivers/dax/bus.c benefit from a cleanup using these, so change these
> > > > where applicable.
> > > 
> > > Wait, why are you needing to call device_lock() at all here?  Why is dax
> > > special in needing this when no other subsystem requires it?
> > > 
> > > > 
> > > > Cc: Joao Martins 
> > > > Cc: Dan Williams 
> > > > Signed-off-by: Vishal Verma 
> > > > ---
> > > >  drivers/dax/bus.c | 143 
> > > > ++
> > > >  1 file changed, 59 insertions(+), 84 deletions(-)
> > > > 
> > > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > > > index 1ff1ab5fa105..6226de131d17 100644
> > > > --- a/drivers/dax/bus.c
> > > > +++ b/drivers/dax/bus.c
> > > > @@ -294,13 +294,10 @@ static ssize_t available_size_show(struct device 
> > > > *dev,
> > > > struct device_attribute *attr, char *buf)
> > > >  {
> > > > struct dax_region *dax_region = dev_get_drvdata(dev);
> > > > -   unsigned long long size;
> > > >  
> > > > -   device_lock(dev);
> > > > -   size = dax_region_avail_size(dax_region);
> > > > -   device_unlock(dev);
> > > > +   guard(device)(dev);
> > > 
> > > You have a valid device here, why are you locking it?  How can it go
> > > away?  And if it can, shouldn't you have a local lock for it, and not
> > > abuse the driver core lock?
> > 
> > Yes, this is a driver-core lock abuse written by someone who should have
> > known better. And yes, a local lock to protect the dax_region resource
> > tree should replace this. A new rwsem to synchronize all list walks
> > seems appropriate.
> 
> I see why _a_ lock is needed both here and in size_show() - the size
> calculations do a walk over discontiguous ranges, and we don't want the
> device to get reconfigured in the middle of that. A different local
> lock seems reasonable - however can that go as a separate cleanup that
> stands on its own?

Sure, but do not add a conversion to use guard(device) here, as that
will be pointless as you will just use a real lock instead.

> For this series, I'll add a cleanup to replace the sprintfs with
> sysfs_emit().

Why not have that be the first patch in the series?  Then add your local
lock and convert everything to use it instead of the device lock?

thanks,

greg k-h



Re: [PATCH] tracing: Add disable-filter-buf option

2023-12-15 Thread Steven Rostedt
On Fri, 15 Dec 2023 12:24:14 -0500
Mathieu Desnoyers  wrote:

> On 2023-12-15 12:04, Steven Rostedt wrote:
> > On Fri, 15 Dec 2023 10:53:39 -0500
> > Mathieu Desnoyers  wrote:  
> [...]
> >>
> >> So rather than stacking tons of "on/off" switches for filter
> >> features, how about you let users express the mechanism they
> >> want to use for filtering with a string instead ? e.g.
> >>
> >> filter-method="temp-buffer"
> >> filter-method="ring-buffer"
> >> filter-method="input-arguments"  
> > 
> > If I add other ways to filter, it will be a separate file to control
> > that, but all options are on/off switches. Even if I add other
> > functionality to the way buffers are created, this will still have the
> > same functionality to turn the entire thing on or off.  
> 
> I'll be clearer then: I think this is a bad ABI. In your reply, you justify
> this bad ABI by implementation motivations.

What's wrong with a way to stop the copying ?

The option is just a way to say "I don't want to do the copy into the
buffer, I want to go directly into it"

> 
> I don't care about the implementation, I care about the ABI, and
> I feel that your reply is not addressing my concern at all.

Maybe I don't understand your concern.

It's an on/off switch (like all options are). This is just a way to say
"I want to indirect copying of the event before filtering or not".

The "input-argument" part above may never happen. What's different
between tracefs and LTTng, is that all events are created by the
subsystem not by me. You don't use the TRACE_EVENT() macro, but you
need to manually create each event you care about yourself. It's more
of a burden but you also then have the luxury of hooking to the input
portion. That is not exposed, and that is by design. As that could
possibly make all tracepoints an ABI, and you'll have people like
peterz nacking any new tracepoint in the scheduler. He doesn't allow
trace events anymore because of that exposure.

> 
> Moreover, double-negative boolean options (disable-X=false) are confusing.

I agree with this, and will rename it to "filter-buffer" instead. I
blame getting up at 3am for my 5:15am flight for not thinking clearly
on that ;-)

Thanks!

-- Steve



Re: [PATCH v6 2/4] dax/bus: Use guard(device) in sysfs attribute helpers

2023-12-15 Thread Verma, Vishal L
On Fri, 2023-12-15 at 09:15 -0800, Dan Williams wrote:
> Greg Kroah-Hartman wrote:
> > On Thu, Dec 14, 2023 at 10:25:27PM -0700, Vishal Verma wrote:
> > > Use the guard(device) macro to lock a 'struct device', and unlock it
> > > automatically when going out of scope using Scope Based Resource
> > > Management semantics. A lot of the sysfs attribute writes in
> > > drivers/dax/bus.c benefit from a cleanup using these, so change these
> > > where applicable.
> > 
> > Wait, why are you needing to call device_lock() at all here?  Why is dax
> > special in needing this when no other subsystem requires it?
> > 
> > > 
> > > Cc: Joao Martins 
> > > Cc: Dan Williams 
> > > Signed-off-by: Vishal Verma 
> > > ---
> > >  drivers/dax/bus.c | 143 
> > > ++
> > >  1 file changed, 59 insertions(+), 84 deletions(-)
> > > 
> > > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > > index 1ff1ab5fa105..6226de131d17 100644
> > > --- a/drivers/dax/bus.c
> > > +++ b/drivers/dax/bus.c
> > > @@ -294,13 +294,10 @@ static ssize_t available_size_show(struct device 
> > > *dev,
> > > struct device_attribute *attr, char *buf)
> > >  {
> > > struct dax_region *dax_region = dev_get_drvdata(dev);
> > > -   unsigned long long size;
> > >  
> > > -   device_lock(dev);
> > > -   size = dax_region_avail_size(dax_region);
> > > -   device_unlock(dev);
> > > +   guard(device)(dev);
> > 
> > You have a valid device here, why are you locking it?  How can it go
> > away?  And if it can, shouldn't you have a local lock for it, and not
> > abuse the driver core lock?
> 
> Yes, this is a driver-core lock abuse written by someone who should have
> known better. And yes, a local lock to protect the dax_region resource
> tree should replace this. A new rwsem to synchronize all list walks
> seems appropriate.

I see why _a_ lock is needed both here and in size_show() - the size
calculations do a walk over discontiguous ranges, and we don't want the
device to get reconfigured in the middle of that. A different local
lock seems reasonable - however can that go as a separate cleanup that
stands on its own?

For this series, I'll add a cleanup to replace the sprintfs with
sysfs_emit().


Re: [PATCH] tracing: Add disable-filter-buf option

2023-12-15 Thread Mathieu Desnoyers

On 2023-12-15 12:04, Steven Rostedt wrote:

On Fri, 15 Dec 2023 10:53:39 -0500
Mathieu Desnoyers  wrote:

[...]


So rather than stacking tons of "on/off" switches for filter
features, how about you let users express the mechanism they
want to use for filtering with a string instead ? e.g.

filter-method="temp-buffer"
filter-method="ring-buffer"
filter-method="input-arguments"


If I add other ways to filter, it will be a separate file to control
that, but all options are on/off switches. Even if I add other
functionality to the way buffers are created, this will still have the
same functionality to turn the entire thing on or off.


I'll be clearer then: I think this is a bad ABI. In your reply, you justify
this bad ABI by implementation motivations.

I don't care about the implementation, I care about the ABI, and
I feel that your reply is not addressing my concern at all.

Moreover, double-negative boolean options (disable-X=false) are confusing.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH v6 2/4] dax/bus: Use guard(device) in sysfs attribute helpers

2023-12-15 Thread Dan Williams
Greg Kroah-Hartman wrote:
> On Thu, Dec 14, 2023 at 10:25:27PM -0700, Vishal Verma wrote:
> > Use the guard(device) macro to lock a 'struct device', and unlock it
> > automatically when going out of scope using Scope Based Resource
> > Management semantics. A lot of the sysfs attribute writes in
> > drivers/dax/bus.c benefit from a cleanup using these, so change these
> > where applicable.
> 
> Wait, why are you needing to call device_lock() at all here?  Why is dax
> special in needing this when no other subsystem requires it?
> 
> > 
> > Cc: Joao Martins 
> > Cc: Dan Williams 
> > Signed-off-by: Vishal Verma 
> > ---
> >  drivers/dax/bus.c | 143 
> > ++
> >  1 file changed, 59 insertions(+), 84 deletions(-)
> > 
> > diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> > index 1ff1ab5fa105..6226de131d17 100644
> > --- a/drivers/dax/bus.c
> > +++ b/drivers/dax/bus.c
> > @@ -294,13 +294,10 @@ static ssize_t available_size_show(struct device *dev,
> > struct device_attribute *attr, char *buf)
> >  {
> > struct dax_region *dax_region = dev_get_drvdata(dev);
> > -   unsigned long long size;
> >  
> > -   device_lock(dev);
> > -   size = dax_region_avail_size(dax_region);
> > -   device_unlock(dev);
> > +   guard(device)(dev);
> 
> You have a valid device here, why are you locking it?  How can it go
> away?  And if it can, shouldn't you have a local lock for it, and not
> abuse the driver core lock?

Yes, this is a driver-core lock abuse written by someone who should have
known better. And yes, a local lock to protect the dax_region resource
tree should replace this. A new rwsem to synchronize all list walks
seems appropriate.



[PATCH net-next 16/24] net: netkit, veth, tun, virt*: Use nested-BH locking for XDP redirect.

2023-12-15 Thread Sebastian Andrzej Siewior
The per-CPU variables used during bpf_prog_run_xdp() invocation and
later during xdp_do_redirect() rely on disabled BH for their protection.
Without locking in local_bh_disable() on PREEMPT_RT these data structure
require explicit locking.

This is a follow-up on the previous change which introduced
bpf_run_lock.redirect_lock and uses it now within drivers.

The simple way is to acquire the lock before bpf_prog_run_xdp() is
invoked and hold it until the end of function.
This does not always work because some drivers (cpsw, atlantic) invoke
xdp_do_flush() in the same context.
Acquiring the lock in bpf_prog_run_xdp() and dropping in
xdp_do_redirect() (without touching drivers) does not work because not
all driver, which use bpf_prog_run_xdp(), do support XDP_REDIRECT (and
invoke xdp_do_redirect()).

Ideally the minimal locking scope would be bpf_prog_run_xdp() +
xdp_do_redirect() and everything else (error recovery, DMA unmapping,
free/ alloc of memory, …) would happen outside of the locked section.

Cc: "K. Y. Srinivasan" 
Cc: "Michael S. Tsirkin" 
Cc: Alexei Starovoitov 
Cc: Andrii Nakryiko 
Cc: Dexuan Cui 
Cc: Haiyang Zhang 
Cc: Hao Luo 
Cc: Jesper Dangaard Brouer 
Cc: Jiri Olsa 
Cc: John Fastabend 
Cc: Juergen Gross 
Cc: KP Singh 
Cc: Martin KaFai Lau 
Cc: Nikolay Aleksandrov 
Cc: Song Liu 
Cc: Stanislav Fomichev 
Cc: Stefano Stabellini 
Cc: Wei Liu 
Cc: Willem de Bruijn 
Cc: Xuan Zhuo 
Cc: Yonghong Song 
Cc: b...@vger.kernel.org
Cc: virtualizat...@lists.linux.dev
Cc: xen-de...@lists.xenproject.org
Signed-off-by: Sebastian Andrzej Siewior 
---
 drivers/net/hyperv/netvsc_bpf.c |  1 +
 drivers/net/netkit.c| 13 +++
 drivers/net/tun.c   | 28 +--
 drivers/net/veth.c  | 40 -
 drivers/net/virtio_net.c|  1 +
 drivers/net/xen-netfront.c  |  1 +
 6 files changed, 52 insertions(+), 32 deletions(-)

diff --git a/drivers/net/hyperv/netvsc_bpf.c b/drivers/net/hyperv/netvsc_bpf.c
index 4a9522689fa4f..55f8ca92ca199 100644
--- a/drivers/net/hyperv/netvsc_bpf.c
+++ b/drivers/net/hyperv/netvsc_bpf.c
@@ -58,6 +58,7 @@ u32 netvsc_run_xdp(struct net_device *ndev, struct 
netvsc_channel *nvchan,
 
memcpy(xdp->data, data, len);
 
+   guard(local_lock_nested_bh)(_run_lock.redirect_lock);
act = bpf_prog_run_xdp(prog, xdp);
 
switch (act) {
diff --git a/drivers/net/netkit.c b/drivers/net/netkit.c
index 39171380ccf29..fbcf78477bda8 100644
--- a/drivers/net/netkit.c
+++ b/drivers/net/netkit.c
@@ -80,8 +80,15 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct 
net_device *dev)
netkit_prep_forward(skb, !net_eq(dev_net(dev), dev_net(peer)));
skb->dev = peer;
entry = rcu_dereference(nk->active);
-   if (entry)
-   ret = netkit_run(entry, skb, ret);
+   if (entry) {
+   scoped_guard(local_lock_nested_bh, _run_lock.redirect_lock) 
{
+   ret = netkit_run(entry, skb, ret);
+   if (ret == NETKIT_REDIRECT) {
+   dev_sw_netstats_tx_add(dev, 1, len);
+   skb_do_redirect(skb);
+   }
+   }
+   }
switch (ret) {
case NETKIT_NEXT:
case NETKIT_PASS:
@@ -95,8 +102,6 @@ static netdev_tx_t netkit_xmit(struct sk_buff *skb, struct 
net_device *dev)
}
break;
case NETKIT_REDIRECT:
-   dev_sw_netstats_tx_add(dev, 1, len);
-   skb_do_redirect(skb);
break;
case NETKIT_DROP:
default:
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index afa5497f7c35c..fe0d31f11e4b6 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1708,16 +1708,18 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
xdp_init_buff(, buflen, >xdp_rxq);
xdp_prepare_buff(, buf, pad, len, false);
 
-   act = bpf_prog_run_xdp(xdp_prog, );
-   if (act == XDP_REDIRECT || act == XDP_TX) {
-   get_page(alloc_frag->page);
-   alloc_frag->offset += buflen;
-   }
-   err = tun_xdp_act(tun, xdp_prog, , act);
-   if (err < 0) {
-   if (act == XDP_REDIRECT || act == XDP_TX)
-   put_page(alloc_frag->page);
-   goto out;
+   scoped_guard(local_lock_nested_bh, _run_lock.redirect_lock) 
{
+   act = bpf_prog_run_xdp(xdp_prog, );
+   if (act == XDP_REDIRECT || act == XDP_TX) {
+   get_page(alloc_frag->page);
+   alloc_frag->offset += buflen;
+   }
+   err = tun_xdp_act(tun, xdp_prog, , act);
+   if (err < 0) {
+   if (act == XDP_REDIRECT || act == XDP_TX)
+ 

Re: [PATCH] tracing: Add disable-filter-buf option

2023-12-15 Thread Steven Rostedt
On Fri, 15 Dec 2023 10:53:39 -0500
Mathieu Desnoyers  wrote:
> 
> 
> I'm not convinced that a boolean state is what you need here.

I will admit the biggest motivation for this was to allow for debugging ;-)

> 
> Yes, today you are in a position where you have two options:
> 
> a) use the filter buffer, which falls back on copy to ring buffer
> if nested,
> 
> b) disable the filter buffer, and thus always copy to ring buffer
> before filtering,
> 
> But I think it would not be far-fetched to improve the implementation
> of the filter-buffer to have one filter-buffer per nesting level
> (per execution context), or just implement the filter buffer as a
> per-cpu stack, which would remove the need to fall back on copy
> to ring buffer when nested. Or you could even do like LTTng and
> filter on the input arguments directly.

The filtering on the input arguments is much more difficult as they are
not exposed to user space. I plan on adding that feature, but it will
be more tied to the probe infrastructure, and be separate from this
type of filtering.

When looking at removing the discard, I found that the histogram logic
currently depends on writing to the ring buffer directly. That's
because it needs to know the timestamp, and may or may not discard it.
I'll have to look at this code more and see if I can get rid of that.
In fact, this patch taps into that logic (it's what added the
tracing_set_filter_buffering() function.

> 
> But each of these changes would add yet another boolean tunable,
> which can quickly make the mix hard to understand from a user
> perspective.

Honestly, if I do the stack filtering (it's already per_cpu), it will
replace this altogether, and this option will still be available. That
is, it will switch off buffering an event regardless of the
implementation.

> 
> So rather than stacking tons of "on/off" switches for filter
> features, how about you let users express the mechanism they
> want to use for filtering with a string instead ? e.g.
> 
> filter-method="temp-buffer"
> filter-method="ring-buffer"
> filter-method="input-arguments"

If I add other ways to filter, it will be a separate file to control
that, but all options are on/off switches. Even if I add other
functionality to the way buffers are created, this will still have the
same functionality to turn the entire thing on or off.

Thanks for the review.

-- Steve



Re: [PATCH V2] remoteproc: virtio: Fix wdg cannot recovery remote processor

2023-12-15 Thread Arnaud POULIQUEN
Hello  Joakim,

On 12/15/23 15:50, joakim.zh...@cixtech.com wrote:
> From: Joakim Zhang 
> 
> Recovery remote processor failed when wdg irq received:
> [0.842574] remoteproc remoteproc0: crash detected in cix-dsp-rproc: type 
> watchdog
> [0.842750] remoteproc remoteproc0: handling crash #1 in cix-dsp-rproc
> [0.842824] remoteproc remoteproc0: recovering cix-dsp-rproc
> [0.843342] remoteproc remoteproc0: stopped remote processor cix-dsp-rproc
> [0.847901] rproc-virtio rproc-virtio.0.auto: Failed to associate buffer
> [0.847979] remoteproc remoteproc0: failed to probe subdevices for 
> cix-dsp-rproc: -16
> 
> The reason is that dma coherent mem would not be released when
> recovering the remote processor, due to rproc_virtio_remove()
> would not be called, where the mem released. It will fail when
> it try to allocate and associate buffer again.
> 
> We can see that dma coherent mem allocated from rproc_add_virtio_dev(),
> so should release it from rproc_remove_virtio_dev(). These functions should
> appear symmetrically:
> -rproc_vdev_do_start()->rproc_add_virtio_dev()->dma_declare_coherent_memory()
> -rproc_vdev_do_stop()->rproc_remove_virtio_dev()->dma_release_coherent_memory()
> 
> The same for of_reserved_mem_device_init_by_idx() and 
> of_reserved_mem_device_release().
> 
> Fixes: 1d7b61c06dc3 ("remoteproc: virtio: Create platform device for the 
> remoteproc_virtio")
> Signed-off-by: Joakim Zhang 
> ---
> ChangeLogs:
> V1->V2:
> * the same for of_reserved_mem_device_release()
> ---
>  drivers/remoteproc/remoteproc_virtio.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/remoteproc/remoteproc_virtio.c 
> b/drivers/remoteproc/remoteproc_virtio.c
> index 83d76915a6ad..e877ee78740d 100644
> --- a/drivers/remoteproc/remoteproc_virtio.c
> +++ b/drivers/remoteproc/remoteproc_virtio.c
> @@ -465,8 +465,12 @@ static int rproc_add_virtio_dev(struct rproc_vdev 
> *rvdev, int id)
>  static int rproc_remove_virtio_dev(struct device *dev, void *data)
>  {
> struct virtio_device *vdev = dev_to_virtio(dev);
> +   struct rproc_vdev *rvdev = vdev_to_rvdev(vdev);
> 
> unregister_virtio_device(vdev);
> +   of_reserved_mem_device_release(>pdev->dev);
> +   dma_release_coherent_memory(>pdev->dev);
> +

At this step, the virtio device may not be released and may still be using the
memory.
Do you try to move this in rproc_virtio_dev_release?

Regards,
Arnaud

> return 0;
>  }
> 
> @@ -584,9 +588,6 @@ static void rproc_virtio_remove(struct platform_device 
> *pdev)
> rproc_remove_subdev(rproc, >subdev);
> rproc_remove_rvdev(rvdev);
> 
> -   of_reserved_mem_device_release(>dev);
> -   dma_release_coherent_memory(>dev);
> -
> put_device(>dev);
>  }
> 
> --
> 2.25.1
> 
> 
> 
> This email (including its attachments) is intended only for the person or 
> entity to which it is addressed and may contain information that is 
> privileged, confidential or otherwise protected from disclosure. Unauthorized 
> use, dissemination, distribution or copying of this email or the information 
> herein or taking any action in reliance on the contents of this email or the 
> information herein, by anyone other than the intended recipient, or an 
> employee or agent responsible for delivering the message to the intended 
> recipient, is strictly prohibited. If you are not the intended recipient, 
> please do not read, copy, use or disclose any part of this e-mail to others. 
> Please notify the sender immediately and permanently delete this e-mail and 
> any attachments if you received it in error. Internet communications cannot 
> be guaranteed to be timely, secure, error-free or virus-free. The sender does 
> not accept liability for any errors or omissions.



[PATCH 1/2] ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update()

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

There's only one place that performs a 64-bit cmpxchg for the timestamp
processing. The cmpxchg is only to set the write_stamp equal to the
before_stamp, and if it doesn't get set, then the next event will simply
be forced to add an absolute timestamp.

Given that 64-bit cmpxchg is expensive on 32-bit, and the current
workaround uses 3 consecutive 32-bit cmpxchg doesn't make it any faster.
It's best to just not do the cmpxchg as a simple compare works for the
accuracy of the timestamp. The only thing that will happen without the
cmpxchg is the prepended absolute timestamp on the next event which is not
that big of a deal as the path where this happens is seldom hit because it
requires an interrupt to happen between a few lines of code that also
writes an event into the same buffer.

With this change, the 32-bit rb_time_t workaround can be removed.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 23 ---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 1a26b3a1901f..c6842a4331a9 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -762,6 +762,15 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 
set)
 }
 #endif
 
+/*
+ * Returns true if t == expect, and if possible will update with set,
+ * but t is not guaranteed to be updated even if this returns true
+ */
+static bool rb_time_cmp_and_update(rb_time_t *t, u64 expect, u64 set)
+{
+   return rb_time_cmpxchg(t, expect, set);
+}
+
 /*
  * Enable this to make sure that the event passed to
  * ring_buffer_event_time_stamp() is not committed and also
@@ -3622,9 +3631,17 @@ __rb_reserve_next(struct ring_buffer_per_cpu *cpu_buffer,
barrier();
  /*E*/ if (write == (local_read(_page->write) & RB_WRITE_MASK) &&
info->after < ts &&
-   rb_time_cmpxchg(_buffer->write_stamp,
-   info->after, ts)) {
-   /* Nothing came after this event between C and E */
+   rb_time_cmp_and_update(_buffer->write_stamp,
+  info->after, ts)) {
+   /*
+* Nothing came after this event between C and E it is
+* safe to use info->after for the delta.
+* The above rb_time_cmp_and_update() may or may not
+* have updated the write_stamp. If it did not then
+* the next event will simply add an absolute timestamp
+* as the write_stamp will be different than the
+* before_stamp.
+*/
info->delta = ts - info->after;
} else {
/*
-- 
2.42.0





[PATCH 2/2] ring-buffer: Remove 32bit timestamp logic

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.

Until a389d86f7fd09 ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7fd09 changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.

To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.

Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.

After some more refactoring of the code, 3 of the 4 64-bit cmpxchg were
removed:

 
https://lore.kernel.org/linux-trace-kernel/2023124420.36dde...@gandalf.local.home
 
https://lore.kernel.org/linux-trace-kernel/20231214222921.19303...@gandalf.local.home
 
https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f3...@rorschach.local.home

The last remaining 64-bit cmpxchg is only in a slow path and to keep the
next event from having to add an absolute timestamp. This is not worth the
trouble of a slow cmpxchg. Only do the cmpxchg on 64-bit architectures and
just have the 32-bit architectures insert an absolute timestamp for the
next event that comes after the slow path. The slow path only happens if
the event being recorded is interrupted between a few lines of code and
that interrupt writes an event into the same buffer. This seldom happens
and should not cause any issues by inserting an extra absolute timestamp
when it does happen.

Now the complex 32-bit workaround for 64-bit cmpxchg can be removed.

Link: https://lore.kernel.org/all/20231213214632.15047...@gandalf.local.home/

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 230 +++--
 1 file changed, 19 insertions(+), 211 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c6842a4331a9..47f9eda99769 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -463,27 +463,9 @@ enum {
RB_CTX_MAX
 };
 
-#if BITS_PER_LONG == 32
-#define RB_TIME_32
-#endif
-
-/* To test on 64 bit machines */
-//#define RB_TIME_32
-
-#ifdef RB_TIME_32
-
-struct rb_time_struct {
-   local_t cnt;
-   local_t top;
-   local_t bottom;
-   local_t msb;
-};
-#else
-#include 
 struct rb_time_struct {
local64_t   time;
 };
-#endif
 typedef struct rb_time_struct rb_time_t;
 
 #define MAX_NEST   5
@@ -573,202 +555,35 @@ struct ring_buffer_iter {
int missed_events;
 };
 
-#ifdef RB_TIME_32
-
-/*
- * On 32 bit machines, local64_t is very expensive. As the ring
- * buffer doesn't need all the features of a true 64 bit atomic,
- * on 32 bit, it uses these functions (64 still uses local64_t).
- *
- * For the ring buffer, 64 bit required operations for the time is
- * the following:
- *
- *  - Reads may fail if it interrupted a modification of the time stamp.
- *  It will succeed if it did not interrupt another write even if
- *  the read itself is interrupted by a write.
- *  It returns whether it was successful or not.
- *
- *  - Writes always succeed and will overwrite other writes and writes
- *  that were done by events interrupting the current write.
- *
- *  - A write followed by a read of the same time stamp will always succeed,
- *  but may not contain the same value.
- *
- *  - A cmpxchg will fail if it interrupted another write or cmpxchg.
- *  Other than that, it acts like a normal cmpxchg.
- *
- * The 60 bit time stamp is broken up by 30 bits in a top and bottom half
- *  (bottom being the least significant 30 bits of the 60 bit time stamp).
- *
- * The two most significant bits of each half holds a 2 bit counter (0-3).
- * Each update will increment this counter by one.
- * When reading the top and bottom, if the two counter bits match then the
- *  top and bottom together make a valid 60 bit number.
- */
-#define RB_TIME_SHIFT  30
-#define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1)
-#define RB_TIME_MSB_SHIFT   60
-
-static 

[PATCH 0/2] tracing: Replace final 64-bit cmpxchg with compare and update if available

2023-12-15 Thread Steven Rostedt


With the introduction of a389d86f7fd09 ("ring-buffer: Have nested events
still record running time stamp"), the timestamps required needing 64-bit
cmpxchg. As some architectures do no even have a 64-bit cmpxchg, the code
for 32-bit architectures used 3 32-bit words that represented the 64-bit
timestamp and this also required performing 3 32-bit cmpxchg where a single
64-bit would do.

While debugging the ring-buffer, it was found that the locations of 3 of the
4 cmpxchg() were actually causing bugs, and the code was restructured to
remove them! See:

 
https://lore.kernel.org/linux-trace-kernel/2023124420.36dde...@gandalf.local.home
 
https://lore.kernel.org/linux-trace-kernel/20231214222921.19303...@gandalf.local.home
 
https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f3...@rorschach.local.home

The only remaining cmpcxhg() is in a slow path that gets hit if an interrupt
were to come in during the allocation of an event and itself would right an
event to the same buffer. The interrupted event would detect this, and use
the cmpxchg for two purposes. One was to know if the interrupt happened
before it allocated its space (where it can use the timestamp that was
found), and the other is to set the write_stamp back to matching the
before_stamp, where the event after the interrupted event would not need to
add an absolute timestamp (it's 8 bytes added to the ring buffer).

The first purpose of the cmpxchg could also be done with a simple compare.
The second purpose (avoiding the addition of the absolute timestamp)
requires the cmpxchg. Considering the complexity of the 32-bit version of
the 64-bit cmpxchg, avoiding an added absolute timestamp doesn't justify it.

The first patch replaces the last 64-bit cmpxchg() with a
rb_time_cmp_and_update() that will return true if the timestamp matches the
expected result. It will optionally update the timestamp with the "set"
parameter if cmpxchg is available.

The second patch removes the 32-bit version of the 64-bit cmpxchg and simply
does the compare. This also removes the complex logic of that code. The
difference now is that 32-bit architectures will have to add an absolute
timestamp to an event that follows an event that suffered the race
condition.


Steven Rostedt (Google) (2):
  ring-buffer: Replace rb_time_cmpxchg() with rb_time_cmp_and_update()
  ring-buffer: Remove 32bit timestamp logic


 kernel/trace/ring_buffer.c | 247 +++--
 1 file changed, 36 insertions(+), 211 deletions(-)



Re: [PATCH] [v2] nvdimm-btt: fix several memleaks

2023-12-15 Thread Ira Weiny
Ira Weiny wrote:
> Dinghao Liu wrote:

[snip]

> >  
> > -static int btt_maplocks_init(struct arena_info *arena)
> > +static int btt_maplocks_init(struct device *dev, struct arena_info *arena)
> >  {
> > u32 i;
> >  
> > -   arena->map_locks = kcalloc(arena->nfree, sizeof(struct aligned_lock),
> > +   arena->map_locks = devm_kcalloc(dev, arena->nfree, sizeof(struct 
> > aligned_lock),
> > GFP_KERNEL);
> > if (!arena->map_locks)
> > return -ENOMEM;
> > @@ -805,9 +805,6 @@ static void free_arenas(struct btt *btt)
> >  
> > list_for_each_entry_safe(arena, next, >arena_list, list) {
> > list_del(>list);
> > -   kfree(arena->rtt);
> > -   kfree(arena->map_locks);
> > -   kfree(arena->freelist);
> 
> This does not quite work.
> 
> free_arenas() is used in the error paths of create_arenas() and
> discover_arenas().  In those cases devm_kfree() is probably a better way
> to clean up this.
> 
> However...
> 
> > debugfs_remove_recursive(arena->debugfs_dir);
> > kfree(arena);
> 
> Why can't arena be allocated with devm_*?
> 
> We need to change this up a bit more to handle the error path vs regular
> device shutdown free (automatic devm frees).

We might want to look at using no_free_ptr() in this code.  See this
patch[1] for an example of how to inhibit the cleanup and pass the pointer
on when the function succeeds.

[1] 
https://lore.kernel.org/all/170261791914.1714654.6447680285357545638.st...@dwillia2-xfh.jf.intel.com/

Ira



Re: [PATCH v6 2/4] dax/bus: Use guard(device) in sysfs attribute helpers

2023-12-15 Thread gre...@linuxfoundation.org
On Fri, Dec 15, 2023 at 06:33:58AM +, Verma, Vishal L wrote:
> On Fri, 2023-12-15 at 05:56 +, Matthew Wilcox wrote:
> > On Thu, Dec 14, 2023 at 10:25:27PM -0700, Vishal Verma wrote:
> > > @@ -294,13 +294,10 @@ static ssize_t available_size_show(struct device 
> > > *dev,
> > > struct device_attribute *attr, char *buf)
> > >  {
> > > struct dax_region *dax_region = dev_get_drvdata(dev);
> > > -   unsigned long long size;
> > >  
> > > -   device_lock(dev);
> > > -   size = dax_region_avail_size(dax_region);
> > > -   device_unlock(dev);
> > > +   guard(device)(dev);
> > >  
> > > -   return sprintf(buf, "%llu\n", size);
> > > +   return sprintf(buf, "%llu\n", dax_region_avail_size(dax_region));
> > >  }
> > 
> > Is this an appropriate use of guard()?  sprintf is not the fastest of
> > functions, so we will end up holding the device_lock for longer than
> > we used to.
> 
> Hi Matthew,
> 
> Agreed that we end up holding the lock for a bit longer in many of
> these. I'm inclined to say this is okay, since these are all user
> configuration paths through sysfs, not affecting any sort of runtime
> performance.

Why does the lock have to be taken at all?  You have a valid reference,
isn't that all you need?

thanks,

greg k-h



Re: [PATCH 2/4] sched: Take cpufreq feedback into account

2023-12-15 Thread Lukasz Luba




On 12/12/23 14:27, Vincent Guittot wrote:

Aggregate the different pressures applied on the capacity of CPUs and
create a new function that returns the actual capacity of the CPU:
   get_actual_cpu_capacity()

Signed-off-by: Vincent Guittot 
---
  kernel/sched/fair.c | 43 +++
  1 file changed, 23 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bcea3d55d95d..11d3be829302 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4932,12 +4932,20 @@ static inline void util_est_update(struct cfs_rq 
*cfs_rq,
trace_sched_util_est_se_tp(>se);
  }
  
+static inline unsigned long get_actual_cpu_capacity(int cpu)

+{
+   unsigned long capacity = arch_scale_cpu_capacity(cpu);
+
+   capacity -= max(thermal_load_avg(cpu_rq(cpu)), 
cpufreq_get_pressure(cpu));
+
+   return capacity;
+}
  static inline int util_fits_cpu(unsigned long util,
unsigned long uclamp_min,
unsigned long uclamp_max,
int cpu)
  {
-   unsigned long capacity_orig, capacity_orig_thermal;
+   unsigned long capacity_orig;
unsigned long capacity = capacity_of(cpu);
bool fits, uclamp_max_fits;
  
@@ -4970,7 +4978,6 @@ static inline int util_fits_cpu(unsigned long util,

 * goal is to cap the task. So it's okay if it's getting less.
 */
capacity_orig = arch_scale_cpu_capacity(cpu);
-   capacity_orig_thermal = capacity_orig - 
arch_scale_thermal_pressure(cpu);
  
  	/*

 * We want to force a task to fit a cpu as implied by uclamp_max.
@@ -5045,7 +5052,7 @@ static inline int util_fits_cpu(unsigned long util,
 * handle the case uclamp_min > uclamp_max.
 */
uclamp_min = min(uclamp_min, uclamp_max);
-   if (fits && (util < uclamp_min) && (uclamp_min > capacity_orig_thermal))
+   if (fits && (util < uclamp_min) && (uclamp_min > 
get_actual_cpu_capacity(cpu)))


That's quite long line, I would break it into 2.


return -1;
  
  	return fits;

@@ -7426,7 +7433,7 @@ select_idle_capacity(struct task_struct *p, struct 
sched_domain *sd, int target)
 * Look for the CPU with best capacity.
 */
else if (fits < 0)
-   cpu_cap = arch_scale_cpu_capacity(cpu) - 
thermal_load_avg(cpu_rq(cpu));
+   cpu_cap = get_actual_cpu_capacity(cpu);
  
  		/*

 * First, select CPU which fits better (-1 being better than 0).
@@ -7919,8 +7926,8 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
struct root_domain *rd = this_rq()->rd;
int cpu, best_energy_cpu, target = -1;
int prev_fits = -1, best_fits = -1;
-   unsigned long best_thermal_cap = 0;
-   unsigned long prev_thermal_cap = 0;
+   unsigned long best_actual_cap = 0;
+   unsigned long prev_actual_cap = 0;
struct sched_domain *sd;
struct perf_domain *pd;
struct energy_env eenv;
@@ -7950,7 +7957,7 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
  
  	for (; pd; pd = pd->next) {

unsigned long util_min = p_util_min, util_max = p_util_max;
-   unsigned long cpu_cap, cpu_thermal_cap, util;
+   unsigned long cpu_cap, cpu_actual_cap, util;
long prev_spare_cap = -1, max_spare_cap = -1;
unsigned long rq_util_min, rq_util_max;
unsigned long cur_delta, base_energy;
@@ -7962,18 +7969,17 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
if (cpumask_empty(cpus))
continue;
  
-		/* Account thermal pressure for the energy estimation */

+   /* Account external pressure for the energy estimation */
cpu = cpumask_first(cpus);
-   cpu_thermal_cap = arch_scale_cpu_capacity(cpu);
-   cpu_thermal_cap -= arch_scale_thermal_pressure(cpu);
+   cpu_actual_cap = get_actual_cpu_capacity(cpu);
  
-		eenv.cpu_cap = cpu_thermal_cap;

+   eenv.cpu_cap = cpu_actual_cap;
eenv.pd_cap = 0;
  
  		for_each_cpu(cpu, cpus) {

struct rq *rq = cpu_rq(cpu);
  
-			eenv.pd_cap += cpu_thermal_cap;

+   eenv.pd_cap += cpu_actual_cap;
  
  			if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))

continue;
@@ -8044,7 +8050,7 @@ static int find_energy_efficient_cpu(struct task_struct 
*p, int prev_cpu)
if (prev_delta < base_energy)
goto unlock;
prev_delta -= base_energy;
-   prev_thermal_cap = cpu_thermal_cap;
+   prev_actual_cap = cpu_actual_cap;
best_delta = min(best_delta, prev_delta);

Re: [PATCH] tracing: Add disable-filter-buf option

2023-12-15 Thread Mathieu Desnoyers

On 2023-12-15 10:26, Steven Rostedt wrote:

From: "Steven Rostedt (Google)" 

Normally, when the filter is enabled, a temporary buffer is created to
copy the event data into it to perform the filtering logic. If the filter
passes and the event should be recorded, then the event is copied from the
temporary buffer into the ring buffer. If the event is to be discarded
then it is simply dropped. If another event comes in via an interrupt, it
will not use the temporary buffer as it is busy and will write directly
into the ring buffer.

The disable-filter-buf option will disable the temporary buffer and always
write into the ring buffer. This will avoid the copy when the event is to
be recorded, but also adds a bit more overhead on the discard, and if
another event were to interrupt the event that is to be discarded, then
the event will not be removed from the ring buffer but instead converted
to padding that will not be read by the reader. Padding will still take up
space on the ring buffer.

This option can be beneficial if most events are recorded and not
discarded, or simply for debugging the discard functionality of the ring
buffer.

Also fix some whitespace (that was fixed by editing this in vscode).



I'm not convinced that a boolean state is what you need here.

Yes, today you are in a position where you have two options:

a) use the filter buffer, which falls back on copy to ring buffer
   if nested,

b) disable the filter buffer, and thus always copy to ring buffer
   before filtering,

But I think it would not be far-fetched to improve the implementation
of the filter-buffer to have one filter-buffer per nesting level
(per execution context), or just implement the filter buffer as a
per-cpu stack, which would remove the need to fall back on copy
to ring buffer when nested. Or you could even do like LTTng and
filter on the input arguments directly.

But each of these changes would add yet another boolean tunable,
which can quickly make the mix hard to understand from a user
perspective.

So rather than stacking tons of "on/off" switches for filter
features, how about you let users express the mechanism they
want to use for filtering with a string instead ? e.g.

filter-method="temp-buffer"
filter-method="ring-buffer"
filter-method="input-arguments"

?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com




Re: [PATCH 0/5] Rework system pressure interface to the scheduler

2023-12-15 Thread Lukasz Luba




On 12/14/23 08:32, Lukasz Luba wrote:



On 12/14/23 08:29, Vincent Guittot wrote:

On Thu, 14 Dec 2023 at 09:21, Lukasz Luba  wrote:


Hi Vincent,

I've been waiting for this feature, thanks!


On 12/12/23 14:27, Vincent Guittot wrote:
Following the consolidation and cleanup of CPU capacity in [1], this 
serie
reworks how the scheduler gets the pressures on CPUs. We need to 
take into
account all pressures applied by cpufreq on the compute capacity of 
a CPU

for dozens of ms or more and not only cpufreq cooling device or HW
mitigiations. we split the pressure applied on CPU's capacity in 2 
parts:

- one from cpufreq and freq_qos
- one from HW high freq mitigiation.

The next step will be to add a dedicated interface for long standing
capping of the CPU capacity (i.e. for seconds or more) like the
scaling_max_freq of cpufreq sysfs. The latter is already taken into
account by this serie but as a temporary pressure which is not 
always the

best choice when we know that it will happen for seconds or more.

[1] 
https://lore.kernel.org/lkml/20231211104855.558096-1-vincent.guit...@linaro.org/


Vincent Guittot (4):
    cpufreq: Add a cpufreq pressure feedback for the scheduler
    sched: Take cpufreq feedback into account
    thermal/cpufreq: Remove arch_update_thermal_pressure()
    sched: Rename arch_update_thermal_pressure into
  arch_update_hw_pressure

   arch/arm/include/asm/topology.h   |  6 +--
   arch/arm64/include/asm/topology.h |  6 +--
   drivers/base/arch_topology.c  | 26 -
   drivers/cpufreq/cpufreq.c | 48 +
   drivers/cpufreq/qcom-cpufreq-hw.c |  4 +-
   drivers/thermal/cpufreq_cooling.c |  3 --
   include/linux/arch_topology.h |  8 +--
   include/linux/cpufreq.h   | 10 
   include/linux/sched/topology.h    |  8 +--
   .../{thermal_pressure.h => hw_pressure.h} | 14 ++---
   include/trace/events/sched.h  |  2 +-
   init/Kconfig  | 12 ++---
   kernel/sched/core.c   |  8 +--
   kernel/sched/fair.c   | 53 
++-

   kernel/sched/pelt.c   | 18 +++
   kernel/sched/pelt.h   | 16 +++---
   kernel/sched/sched.h  |  4 +-
   17 files changed, 152 insertions(+), 94 deletions(-)
   rename include/trace/events/{thermal_pressure.h => hw_pressure.h} 
(55%)




I would like to test it, but something worries me. Why there is 0/5 in
this subject and only 4 patches?


I removed a patch from the series but copied/pasted the cover letter
subject without noticing the /5 instead of /4


OK





Could you tell me your base branch that I can apply this, please?


It applies on top of tip/sched/core + [1]
and you can find it here:
https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=sched/system-pressure


Thanks for the info and handy link.



I've tested your patches with: DTPM/PowerCap + thermal gov + cpufreq
sysfs scaling_max_freq. It works fine all my cases (couldn't cause
any issues). If you like to test DTPM you will need 2 fixed pending
in Rafael's tree.

So, I'm looking for a your v2 to continue reviewing it.



Re: [PATCH 3/4] thermal/cpufreq: Remove arch_update_thermal_pressure()

2023-12-15 Thread Lukasz Luba




On 12/12/23 14:27, Vincent Guittot wrote:

arch_update_thermal_pressure() aims to update fast changing signal which
should be averaged using PELT filtering before being provided to the
scheduler which can't make smart use of fast changing signal.
cpufreq now provides the maximum freq_qos pressure on the capacity to the
scheduler, which includes cpufreq cooling device. Remove the call to
arch_update_thermal_pressure() in cpufreq cooling device as this is
handled by cpufreq_get_pressure().

Signed-off-by: Vincent Guittot 
---
  drivers/thermal/cpufreq_cooling.c | 3 ---
  1 file changed, 3 deletions(-)

diff --git a/drivers/thermal/cpufreq_cooling.c 
b/drivers/thermal/cpufreq_cooling.c
index e2cc7bd30862..e77d3b44903e 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -448,7 +448,6 @@ static int cpufreq_set_cur_state(struct 
thermal_cooling_device *cdev,
 unsigned long state)
  {
struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
-   struct cpumask *cpus;
unsigned int frequency;
int ret;
  
@@ -465,8 +464,6 @@ static int cpufreq_set_cur_state(struct thermal_cooling_device *cdev,

ret = freq_qos_update_request(_cdev->qos_req, frequency);
if (ret >= 0) {
cpufreq_cdev->cpufreq_state = state;
-   cpus = cpufreq_cdev->policy->related_cpus;
-   arch_update_thermal_pressure(cpus, frequency);
ret = 0;
}
  


Reviewed-by: Lukasz Luba 



[PATCH] tracing: Add disable-filter-buf option

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

Normally, when the filter is enabled, a temporary buffer is created to
copy the event data into it to perform the filtering logic. If the filter
passes and the event should be recorded, then the event is copied from the
temporary buffer into the ring buffer. If the event is to be discarded
then it is simply dropped. If another event comes in via an interrupt, it
will not use the temporary buffer as it is busy and will write directly
into the ring buffer.

The disable-filter-buf option will disable the temporary buffer and always
write into the ring buffer. This will avoid the copy when the event is to
be recorded, but also adds a bit more overhead on the discard, and if
another event were to interrupt the event that is to be discarded, then
the event will not be removed from the ring buffer but instead converted
to padding that will not be read by the reader. Padding will still take up
space on the ring buffer.

This option can be beneficial if most events are recorded and not
discarded, or simply for debugging the discard functionality of the ring
buffer.

Also fix some whitespace (that was fixed by editing this in vscode).

Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/trace/ftrace.rst | 23 +
 kernel/trace/trace.c   | 37 --
 kernel/trace/trace.h   |  1 +
 3 files changed, 46 insertions(+), 15 deletions(-)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 23572f6697c0..7fe96da34962 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -1239,6 +1239,29 @@ Here are the available options:
When the free_buffer is closed, tracing will
stop (tracing_on set to 0).
 
+  disable-filter-buf
+   Normally, when the filter is enabled, a temporary buffer is
+   created to copy the event data into it to perform the
+   filtering logic. If the filter passes and the event should
+   be recorded, then the event is copied from the temporary
+   buffer into the ring buffer. If the event is to be discarded
+   then it is simply dropped. If another event comes in via
+   an interrupt, it will not use the temporary buffer as it is
+   busy and will write directly into the ring buffer.
+
+   This option will disable the temporary buffer and always
+   write into the ring buffer. This will avoid the copy when
+   the event is to be recorded, but also adds a bit more
+   overhead on the discard, and if another event were to interrupt
+   the event that is to be discarded, then the event will not
+   be removed from the ring buffer but instead converted to
+   padding that will not be read by the reader. Padding will
+   still take up space on the ring buffer.
+
+   This option can be beneficial if most events are recorded and
+   not discarded, or simply for debugging the discard functionality
+   of the ring buffer.
+
   irq-info
Shows the interrupt, preempt count, need resched data.
When disabled, the trace looks like::
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 199df497db07..41b674b1b809 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5398,6 +5398,8 @@ int trace_keep_overwrite(struct tracer *tracer, u32 mask, 
int set)
return 0;
 }
 
+static int __tracing_set_filter_buffering(struct trace_array *tr, bool set);
+
 int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
 {
int *map;
@@ -5451,6 +5453,9 @@ int set_tracer_flag(struct trace_array *tr, unsigned int 
mask, int enabled)
if (mask == TRACE_ITER_FUNC_FORK)
ftrace_pid_follow_fork(tr, enabled);
 
+   if (mask == TRACE_ITER_DISABLE_FILTER_BUF)
+   __tracing_set_filter_buffering(tr, enabled);
+
if (mask == TRACE_ITER_OVERWRITE) {
ring_buffer_change_overwrite(tr->array_buffer.buffer, enabled);
 #ifdef CONFIG_TRACER_MAX_TRACE
@@ -6464,7 +6469,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == _trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -7534,27 +7539,29 @@ u64 tracing_event_time_stamp(struct trace_buffer 
*buffer, struct ring_buffer_eve
return ring_buffer_event_time_stamp(buffer, rbe);
 }
 
-/*
- * Set or disable using the per CPU trace_buffer_event when possible.
- */
-int tracing_set_filter_buffering(struct trace_array *tr, bool set)
+static int __tracing_set_filter_buffering(struct trace_array *tr, bool set)
 {
-   int ret = 0;
-
-   mutex_lock(_types_lock);
-
if (set && tr->no_filter_buffering_ref++)
-   goto out;
+   return 0;
 
if (!set) {
-   if (WARN_ON_ONCE(!tr->no_filter_buffering_ref)) {
-   ret = -EINVAL;
-   goto out;
- 

Re: ARM64 Livepatch based on SFrame

2023-12-15 Thread Madhavan T. Venkataraman



On 12/15/23 07:04, Mark Rutland wrote:
> On Thu, Dec 14, 2023 at 02:49:29PM -0600, Madhavan T. Venkataraman wrote:
>> Hi Mark,
> 
> Hi Madhavan,
> 
>> I attended your presentation in the LPC. You mentioned that you could use
>> some help with some pre-requisites for the Livepatch feature.
>> I would like to lend a hand.
> 
> Cool!
> 
> I've been meaning to send a mail round with a summary of the current state of
> things, and what needs to be done going forward, but I haven't had the time
> since LPC to put that together (as e.g. that requires learning some more about
> SFrame).  I'll be disappearing for the holiday shortly, and I intend to pick
> that up in the new year.
> 
>> What would you like me to implement?
> 
> I'm not currently sure exactly what we need/want to implement, and as above I
> think that needs to wait until the new year.
> 

OK.

> However, one thing that you can do that would be very useful is to write up 
> and
> report the GCC DWARF issues that you mentioned in:
> 
>   
> https://lore.kernel.org/linux-arm-kernel/20230202074036.507249-1-madve...@linux.microsoft.com/
> 
> ... as (depending on exactly what those are) those could also affect SFrame
> generation (and thus we'll need to get those fixed in GCC), and regardless it
> would be useful information to know.
> 
> I understood that you planned to do that from:
> 
>   
> https://lore.kernel.org/linux-arm-kernel/054ce0d6-70f0-b834-d4e5-1049c8df7...@linux.microsoft.com/
> 
> ... but I couldn't spot any relevant mails or issues in the GCC bugzilla, so
> either I'm failing to search hard enough, or did that get forgotten about?
> 

Yeah. I had notes on that. But I seem to have lost them. I need to reproduce the
problems and analyze them again which is not trivial. So, I have been 
procrastinating.

I am also disappearing for the rest of this year. I will try to look at it in 
the
new year.

>> I would also like to implement Unwind Hints for the feature. If you want a
>> specific format for the hints, let me know.
> 
> I will get back to you on that in the new year; I think the specifics we want
> are going to depend on other details above we need to analyse first.
> 

OK.

For now, I will implement something and send it out just for reference. We can 
revisit
this topic next year sometime.

Thanks.

Madhavan



[PATCH V2] remoteproc: virtio: Fix wdg cannot recovery remote processor

2023-12-15 Thread joakim . zhang
From: Joakim Zhang 

Recovery remote processor failed when wdg irq received:
[0.842574] remoteproc remoteproc0: crash detected in cix-dsp-rproc: type 
watchdog
[0.842750] remoteproc remoteproc0: handling crash #1 in cix-dsp-rproc
[0.842824] remoteproc remoteproc0: recovering cix-dsp-rproc
[0.843342] remoteproc remoteproc0: stopped remote processor cix-dsp-rproc
[0.847901] rproc-virtio rproc-virtio.0.auto: Failed to associate buffer
[0.847979] remoteproc remoteproc0: failed to probe subdevices for 
cix-dsp-rproc: -16

The reason is that dma coherent mem would not be released when
recovering the remote processor, due to rproc_virtio_remove()
would not be called, where the mem released. It will fail when
it try to allocate and associate buffer again.

We can see that dma coherent mem allocated from rproc_add_virtio_dev(),
so should release it from rproc_remove_virtio_dev(). These functions should
appear symmetrically:
-rproc_vdev_do_start()->rproc_add_virtio_dev()->dma_declare_coherent_memory()
-rproc_vdev_do_stop()->rproc_remove_virtio_dev()->dma_release_coherent_memory()

The same for of_reserved_mem_device_init_by_idx() and 
of_reserved_mem_device_release().

Fixes: 1d7b61c06dc3 ("remoteproc: virtio: Create platform device for the 
remoteproc_virtio")
Signed-off-by: Joakim Zhang 
---
ChangeLogs:
V1->V2:
* the same for of_reserved_mem_device_release()
---
 drivers/remoteproc/remoteproc_virtio.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/remoteproc/remoteproc_virtio.c 
b/drivers/remoteproc/remoteproc_virtio.c
index 83d76915a6ad..e877ee78740d 100644
--- a/drivers/remoteproc/remoteproc_virtio.c
+++ b/drivers/remoteproc/remoteproc_virtio.c
@@ -465,8 +465,12 @@ static int rproc_add_virtio_dev(struct rproc_vdev *rvdev, 
int id)
 static int rproc_remove_virtio_dev(struct device *dev, void *data)
 {
struct virtio_device *vdev = dev_to_virtio(dev);
+   struct rproc_vdev *rvdev = vdev_to_rvdev(vdev);
 
unregister_virtio_device(vdev);
+   of_reserved_mem_device_release(>pdev->dev);
+   dma_release_coherent_memory(>pdev->dev);
+
return 0;
 }
 
@@ -584,9 +588,6 @@ static void rproc_virtio_remove(struct platform_device 
*pdev)
rproc_remove_subdev(rproc, >subdev);
rproc_remove_rvdev(rvdev);
 
-   of_reserved_mem_device_release(>dev);
-   dma_release_coherent_memory(>dev);
-
put_device(>dev);
 }
 
-- 
2.25.1




Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq

2023-12-15 Thread Dragos Tatulea
On Fri, 2023-12-15 at 12:35 +, Dragos Tatulea wrote:
> On Thu, 2023-12-14 at 19:30 +0100, Eugenio Perez Martin wrote:
> > On Thu, Dec 14, 2023 at 4:51 PM Dragos Tatulea  wrote:
> > > 
> > > On Thu, 2023-12-14 at 08:45 -0500, Michael S. Tsirkin wrote:
> > > > On Thu, Dec 14, 2023 at 01:39:55PM +, Dragos Tatulea wrote:
> > > > > On Tue, 2023-12-12 at 15:44 -0800, Si-Wei Liu wrote:
> > > > > > 
> > > > > > On 12/12/2023 11:21 AM, Eugenio Perez Martin wrote:
> > > > > > > On Tue, Dec 5, 2023 at 11:46 AM Dragos Tatulea 
> > > > > > >  wrote:
> > > > > > > > Addresses get set by .set_vq_address. hw vq addresses will be 
> > > > > > > > updated on
> > > > > > > > next modify_virtqueue.
> > > > > > > > 
> > > > > > > > Signed-off-by: Dragos Tatulea 
> > > > > > > > Reviewed-by: Gal Pressman 
> > > > > > > > Acked-by: Eugenio Pérez 
> > > > > > > I'm kind of ok with this patch and the next one about state, but I
> > > > > > > didn't ack them in the previous series.
> > > > > > > 
> > > > > > > My main concern is that it is not valid to change the vq address 
> > > > > > > after
> > > > > > > DRIVER_OK in VirtIO, which vDPA follows. Only memory maps are ok 
> > > > > > > to
> > > > > > > change at this moment. I'm not sure about vq state in vDPA, but 
> > > > > > > vhost
> > > > > > > forbids changing it with an active backend.
> > > > > > > 
> > > > > > > Suspend is not defined in VirtIO at this moment though, so maybe 
> > > > > > > it is
> > > > > > > ok to decide that all of these parameters may change during 
> > > > > > > suspend.
> > > > > > > Maybe the best thing is to protect this with a vDPA feature flag.
> > > > > > I think protect with vDPA feature flag could work, while on the 
> > > > > > other
> > > > > > hand vDPA means vendor specific optimization is possible around 
> > > > > > suspend
> > > > > > and resume (in case it helps performance), which doesn't have to be
> > > > > > backed by virtio spec. Same applies to vhost user backend features,
> > > > > > variations there were not backed by spec either. Of course, we 
> > > > > > should
> > > > > > try best to make the default behavior backward compatible with
> > > > > > virtio-based backend, but that circles back to no suspend 
> > > > > > definition in
> > > > > > the current virtio spec, for which I hope we don't cease 
> > > > > > development on
> > > > > > vDPA indefinitely. After all, the virtio based vdap backend can well
> > > > > > define its own feature flag to describe (minor difference in) the
> > > > > > suspend behavior based on the later spec once it is formed in 
> > > > > > future.
> > > > > > 
> > > > > So what is the way forward here? From what I understand the options 
> > > > > are:
> > > > > 
> > > > > 1) Add a vdpa feature flag for changing device properties while 
> > > > > suspended.
> > > > > 
> > > > > 2) Drop these 2 patches from the series for now. Not sure if this 
> > > > > makes sense as
> > > > > this. But then Si-Wei's qemu device suspend/resume poc [0] that 
> > > > > exercises this
> > > > > code won't work anymore. This means the series would be less well 
> > > > > tested.
> > > > > 
> > > > > Are there other possible options? What do you think?
> > > > > 
> > > > > [0] https://github.com/siwliu-kernel/qemu/tree/svq-resume-wip
> > > > 
> > > > I am fine with either of these.
> > > > 
> > > How about allowing the change only under the following conditions:
> > >   vhost_vdpa_can_suspend && vhost_vdpa_can_resume &&
> > > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is set
> > > 
> > > ?
> > 
> > I think the best option by far is 1, as there is no hint in the
> > combination of these 3 indicating that you can change device
> > properties in the suspended state.
> > 
> Sure. Will respin a v3 without these two patches.
> 
> Another series can implement option 2 and add these 2 patches on top.
Hmm...I misunderstood your statement and sent a erroneus v3. You said that
having a feature flag is the best option.

Will add a feature flag in v4: is this similar to the
VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK flag?

Thanks,
Dragos 

> > > Thanks,
> > > Dragos
> > > 
> > > > > Thanks,
> > > > > Dragos
> > > > > 
> > > > > > Regards,
> > > > > > -Siwei
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > Jason, what do you think?
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > 
> > > > > > > > ---
> > > > > > > >   drivers/vdpa/mlx5/net/mlx5_vnet.c  | 9 +
> > > > > > > >   include/linux/mlx5/mlx5_ifc_vdpa.h | 1 +
> > > > > > > >   2 files changed, 10 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> > > > > > > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > > > > index f8f088cced50..80e066de0866 100644
> > > > > > > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > > > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > > > > @@ -1209,6 +1209,7 @@ static int modify_virtqueue(struct 
> > > > > > > > mlx5_vdpa_net *ndev,
> > > > > > > >  bool 

Re: [PATCH vhost v3 0/6] vdpa/mlx5: Add support for resumable vqs

2023-12-15 Thread Dragos Tatulea
On Fri, 2023-12-15 at 16:01 +0200, Dragos Tatulea wrote:
> Add support for resumable vqs in the driver. This is a firmware feature
> that can be used for the following benefits:
> - Full device .suspend/.resume.
> - .set_map doesn't need to destroy and create new vqs anymore just to
>   update the map. When resumable vqs are supported it is enough to
>   suspend the vqs, set the new maps, and then resume the vqs.
> 
> The first patch exposes the relevant bits in mlx5_ifc.h. That means it
> needs to be applied to the mlx5-vhost tree [0] first. Once applied
> there, the change has to be pulled from mlx5-vhost into the vhost tree
> and only then the remaining patches can be applied. Same flow as the vq
> descriptor mappings patchset [1].
> 
> The second part adds support for resumable vqs in the form of a device .resume
> operation but also for the .set_map call (suspend/resume device instead
> of re-creating vqs with new mappings).
> 
> The last part of the series introduces reference counting for mrs which
> is necessary to avoid freeing mkeys too early or leaking them.
> 
> * Changes in v3:
> - Dropped patches that allowed vq modification of state and addresses
>   when state is DRIVER_OK. This is not allowed by the standard.
>   Should be re-added under a vdpa feature flag.
> 
> * Changes in v2:
> - Added mr refcounting patches.
> - Deleted unnecessary patch: "vdpa/mlx5: Split function into locked and
>   unlocked variants"
> - Small print improvement in "Introduce per vq and device resume"
>   patch.
> - Patch 1/7 has been applied to mlx5-vhost branch.
> 
> [0] 
> https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-vhost
> [1] 
> https://lore.kernel.org/virtualization/20231018171456.1624030-2-dtatu...@nvidia.com/
> 
> 
> Dragos Tatulea (6):
>   vdpa/mlx5: Expose resumable vq capability
>   vdpa/mlx5: Allow modifying multiple vq fields in one modify command
>   vdpa/mlx5: Introduce per vq and device resume
>   vdpa/mlx5: Use vq suspend/resume during .set_map
>   vdpa/mlx5: Introduce reference counting to mrs
>   vdpa/mlx5: Add mkey leak detection
> 
>  drivers/vdpa/mlx5/core/mlx5_vdpa.h |  10 +-
>  drivers/vdpa/mlx5/core/mr.c|  69 +++---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  | 194 +
>  include/linux/mlx5/mlx5_ifc.h  |   3 +-
>  include/linux/mlx5/mlx5_ifc_vdpa.h |   1 +
>  5 files changed, 239 insertions(+), 38 deletions(-)
> 

Please disregard this version. I will send a v4. Sorry about the noise.


[PATCH vhost v3 6/6] vdpa/mlx5: Add mkey leak detection

2023-12-15 Thread Dragos Tatulea
Track allocated mrs in a list and show warning when leaks are detected
on device free or reset.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  2 ++
 drivers/vdpa/mlx5/core/mr.c| 23 +++
 drivers/vdpa/mlx5/net/mlx5_vnet.c  |  2 ++
 3 files changed, 27 insertions(+)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index 1a0d27b6e09a..50aac8fe57ef 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -37,6 +37,7 @@ struct mlx5_vdpa_mr {
bool user_mr;
 
refcount_t refcount;
+   struct list_head mr_list;
 };
 
 struct mlx5_vdpa_resources {
@@ -95,6 +96,7 @@ struct mlx5_vdpa_dev {
u32 generation;
 
struct mlx5_vdpa_mr *mr[MLX5_VDPA_NUM_AS];
+   struct list_head mr_list_head;
/* serialize mr access */
struct mutex mr_mtx;
struct mlx5_control_vq cvq;
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index c7dc8914354a..4758914ccf86 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -508,6 +508,8 @@ static void _mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev 
*mvdev, struct mlx5_vdpa_
 
vhost_iotlb_free(mr->iotlb);
 
+   list_del(>mr_list);
+
kfree(mr);
 }
 
@@ -560,12 +562,31 @@ void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
mutex_unlock(>mr_mtx);
 }
 
+static void mlx5_vdpa_show_mr_leaks(struct mlx5_vdpa_dev *mvdev)
+{
+   struct mlx5_vdpa_mr *mr;
+
+   mutex_lock(>mr_mtx);
+
+   list_for_each_entry(mr, >mr_list_head, mr_list) {
+
+   mlx5_vdpa_warn(mvdev, "mkey still alive after resource delete: "
+ "mr: %p, mkey: 0x%x, refcount: %u\n",
+  mr, mr->mkey, 
refcount_read(>refcount));
+   }
+
+   mutex_unlock(>mr_mtx);
+
+}
+
 void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev)
 {
for (int i = 0; i < MLX5_VDPA_NUM_AS; i++)
mlx5_vdpa_update_mr(mvdev, NULL, i);
 
prune_iotlb(mvdev->cvq.iotlb);
+
+   mlx5_vdpa_show_mr_leaks(mvdev);
 }
 
 static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
@@ -592,6 +613,8 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
if (err)
goto err_iotlb;
 
+   list_add_tail(>mr_list, >mr_list_head);
+
return 0;
 
 err_iotlb:
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 0df82e4d13f4..88b633682e18 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -3707,6 +3707,8 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
*v_mdev, const char *name,
if (err)
goto err_mpfs;
 
+   INIT_LIST_HEAD(>mr_list_head);
+
if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
err = mlx5_vdpa_create_dma_mr(mvdev);
if (err)
-- 
2.43.0




[PATCH vhost v3 5/6] vdpa/mlx5: Introduce reference counting to mrs

2023-12-15 Thread Dragos Tatulea
Deleting the old mr during mr update (.set_map) and then modifying the
vqs with the new mr is not a good flow for firmware. The firmware
expects that mkeys are deleted after there are no more vqs referencing
them.

Introduce reference counting for mrs to fix this. It is the only way to
make sure that mkeys are not in use by vqs.

An mr reference is taken when the mr is associated to the mr asid table
and when the mr is linked to the vq on create/modify. The reference is
released when the mkey is unlinked from the vq (trough modify/destroy)
and from the mr asid table.

To make things consistent, get rid of mlx5_vdpa_destroy_mr and use
get/put semantics everywhere.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  8 +++--
 drivers/vdpa/mlx5/core/mr.c| 50 --
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 45 ++-
 3 files changed, 78 insertions(+), 25 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index 84547d998bcf..1a0d27b6e09a 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -35,6 +35,8 @@ struct mlx5_vdpa_mr {
struct vhost_iotlb *iotlb;
 
bool user_mr;
+
+   refcount_t refcount;
 };
 
 struct mlx5_vdpa_resources {
@@ -118,8 +120,10 @@ int mlx5_vdpa_destroy_mkey(struct mlx5_vdpa_dev *mvdev, 
u32 mkey);
 struct mlx5_vdpa_mr *mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
 struct vhost_iotlb *iotlb);
 void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev);
-void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev,
- struct mlx5_vdpa_mr *mr);
+void mlx5_vdpa_get_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr);
+void mlx5_vdpa_put_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr);
 void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
 struct mlx5_vdpa_mr *mr,
 unsigned int asid);
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 2197c46e563a..c7dc8914354a 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -498,32 +498,52 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_vdpa_mr *mr
 
 static void _mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev, struct 
mlx5_vdpa_mr *mr)
 {
+   if (WARN_ON(!mr))
+   return;
+
if (mr->user_mr)
destroy_user_mr(mvdev, mr);
else
destroy_dma_mr(mvdev, mr);
 
vhost_iotlb_free(mr->iotlb);
+
+   kfree(mr);
 }
 
-void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev,
- struct mlx5_vdpa_mr *mr)
+static void _mlx5_vdpa_put_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr)
 {
if (!mr)
return;
 
+   if (refcount_dec_and_test(>refcount))
+   _mlx5_vdpa_destroy_mr(mvdev, mr);
+}
+
+void mlx5_vdpa_put_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr)
+{
mutex_lock(>mr_mtx);
+   _mlx5_vdpa_put_mr(mvdev, mr);
+   mutex_unlock(>mr_mtx);
+}
 
-   _mlx5_vdpa_destroy_mr(mvdev, mr);
+static void _mlx5_vdpa_get_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr)
+{
+   if (!mr)
+   return;
 
-   for (int i = 0; i < MLX5_VDPA_NUM_AS; i++) {
-   if (mvdev->mr[i] == mr)
-   mvdev->mr[i] = NULL;
-   }
+   refcount_inc(>refcount);
+}
 
+void mlx5_vdpa_get_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr)
+{
+   mutex_lock(>mr_mtx);
+   _mlx5_vdpa_get_mr(mvdev, mr);
mutex_unlock(>mr_mtx);
-
-   kfree(mr);
 }
 
 void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
@@ -534,20 +554,16 @@ void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
 
mutex_lock(>mr_mtx);
 
+   _mlx5_vdpa_put_mr(mvdev, old_mr);
mvdev->mr[asid] = new_mr;
-   if (old_mr) {
-   _mlx5_vdpa_destroy_mr(mvdev, old_mr);
-   kfree(old_mr);
-   }
 
mutex_unlock(>mr_mtx);
-
 }
 
 void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev)
 {
for (int i = 0; i < MLX5_VDPA_NUM_AS; i++)
-   mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[i]);
+   mlx5_vdpa_update_mr(mvdev, NULL, i);
 
prune_iotlb(mvdev->cvq.iotlb);
 }
@@ -607,6 +623,8 @@ struct mlx5_vdpa_mr *mlx5_vdpa_create_mr(struct 
mlx5_vdpa_dev *mvdev,
if (err)
goto out_err;
 
+   refcount_set(>refcount, 1);
+
return mr;
 
 out_err:
@@ -651,7 +669,7 @@ int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, 
unsigned int asid)
if (asid >= MLX5_VDPA_NUM_AS)
return -EINVAL;
 
-   

[PATCH vhost v3 4/6] vdpa/mlx5: Use vq suspend/resume during .set_map

2023-12-15 Thread Dragos Tatulea
Instead of tearing down and setting up vq resources, use vq
suspend/resume during .set_map to speed things up a bit.

The vq mr is updated with the new mapping while the vqs are suspended.

If the device doesn't support resumable vqs, do the old teardown and
setup dance.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 48 +-
 include/linux/mlx5/mlx5_ifc_vdpa.h |  1 +
 2 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index f8f088cced50..d5883c554333 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1206,9 +1206,11 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
 {
int inlen = MLX5_ST_SZ_BYTES(modify_virtio_net_q_in);
u32 out[MLX5_ST_SZ_DW(modify_virtio_net_q_out)] = {};
+   struct mlx5_vdpa_dev *mvdev = >mvdev;
bool state_change = false;
void *obj_context;
void *cmd_hdr;
+   void *vq_ctx;
void *in;
int err;
 
@@ -1230,6 +1232,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, uid, ndev->mvdev.res.uid);
 
obj_context = MLX5_ADDR_OF(modify_virtio_net_q_in, in, obj_context);
+   vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, 
virtio_q_context);
 
if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE) {
if (!is_valid_state_change(mvq->fw_state, state, 
is_resumable(ndev))) {
@@ -1241,6 +1244,24 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
state_change = true;
}
 
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY) {
+   struct mlx5_vdpa_mr *mr = 
mvdev->mr[mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP]];
+
+   if (mr)
+   MLX5_SET(virtio_q, vq_ctx, virtio_q_mkey, mr->mkey);
+   else
+   mvq->modified_fields &= 
~MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY;
+   }
+
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY) {
+   struct mlx5_vdpa_mr *mr = 
mvdev->mr[mvdev->group2asid[MLX5_VDPA_DATAVQ_DESC_GROUP]];
+
+   if (mr && MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, 
desc_group_mkey_supported))
+   MLX5_SET(virtio_q, vq_ctx, desc_group_mkey, mr->mkey);
+   else
+   mvq->modified_fields &= 
~MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY;
+   }
+
MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
mvq->modified_fields);
err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
if (err)
@@ -2767,24 +2788,35 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
*mvdev,
unsigned int asid)
 {
struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
+   bool teardown = !is_resumable(ndev);
int err;
 
suspend_vqs(ndev);
-   err = save_channels_info(ndev);
-   if (err)
-   return err;
+   if (teardown) {
+   err = save_channels_info(ndev);
+   if (err)
+   return err;
 
-   teardown_driver(ndev);
+   teardown_driver(ndev);
+   }
 
mlx5_vdpa_update_mr(mvdev, new_mr, asid);
 
+   for (int i = 0; i < ndev->cur_num_vqs; i++)
+   ndev->vqs[i].modified_fields |= 
MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY |
+   
MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY;
+
if (!(mvdev->status & VIRTIO_CONFIG_S_DRIVER_OK) || mvdev->suspended)
return 0;
 
-   restore_channels_info(ndev);
-   err = setup_driver(mvdev);
-   if (err)
-   return err;
+   if (teardown) {
+   restore_channels_info(ndev);
+   err = setup_driver(mvdev);
+   if (err)
+   return err;
+   }
+
+   resume_vqs(ndev);
 
return 0;
 }
diff --git a/include/linux/mlx5/mlx5_ifc_vdpa.h 
b/include/linux/mlx5/mlx5_ifc_vdpa.h
index b86d51a855f6..8cbc2592d0f1 100644
--- a/include/linux/mlx5/mlx5_ifc_vdpa.h
+++ b/include/linux/mlx5/mlx5_ifc_vdpa.h
@@ -145,6 +145,7 @@ enum {
MLX5_VIRTQ_MODIFY_MASK_STATE= (u64)1 << 0,
MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_PARAMS  = (u64)1 << 3,
MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_DUMP_ENABLE = (u64)1 << 4,
+   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY= (u64)1 << 11,
MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY  = (u64)1 << 14,
 };
 
-- 
2.43.0




[PATCH vhost v3 3/6] vdpa/mlx5: Introduce per vq and device resume

2023-12-15 Thread Dragos Tatulea
Implement vdpa vq and device resume if capability detected. Add support
for suspend -> ready state change.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 69 +++
 1 file changed, 62 insertions(+), 7 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 1e08a8805640..f8f088cced50 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1170,7 +1170,12 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtqueu
return err;
 }
 
-static bool is_valid_state_change(int oldstate, int newstate)
+static bool is_resumable(struct mlx5_vdpa_net *ndev)
+{
+   return ndev->mvdev.vdev.config->resume;
+}
+
+static bool is_valid_state_change(int oldstate, int newstate, bool resumable)
 {
switch (oldstate) {
case MLX5_VIRTIO_NET_Q_OBJECT_STATE_INIT:
@@ -1178,6 +1183,7 @@ static bool is_valid_state_change(int oldstate, int 
newstate)
case MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY:
return newstate == MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND;
case MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND:
+   return resumable ? newstate == 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY : false;
case MLX5_VIRTIO_NET_Q_OBJECT_STATE_ERR:
default:
return false;
@@ -1200,6 +1206,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
 {
int inlen = MLX5_ST_SZ_BYTES(modify_virtio_net_q_in);
u32 out[MLX5_ST_SZ_DW(modify_virtio_net_q_out)] = {};
+   bool state_change = false;
void *obj_context;
void *cmd_hdr;
void *in;
@@ -1211,9 +1218,6 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
if (!modifiable_virtqueue_fields(mvq))
return -EINVAL;
 
-   if (!is_valid_state_change(mvq->fw_state, state))
-   return -EINVAL;
-
in = kzalloc(inlen, GFP_KERNEL);
if (!in)
return -ENOMEM;
@@ -1226,17 +1230,29 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, uid, ndev->mvdev.res.uid);
 
obj_context = MLX5_ADDR_OF(modify_virtio_net_q_in, in, obj_context);
-   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE)
+
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE) {
+   if (!is_valid_state_change(mvq->fw_state, state, 
is_resumable(ndev))) {
+   err = -EINVAL;
+   goto done;
+   }
+
MLX5_SET(virtio_net_q_object, obj_context, state, state);
+   state_change = true;
+   }
 
MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
mvq->modified_fields);
err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
-   kfree(in);
-   if (!err)
+   if (err)
+   goto done;
+
+   if (state_change)
mvq->fw_state = state;
 
mvq->modified_fields = 0;
 
+done:
+   kfree(in);
return err;
 }
 
@@ -1430,6 +1446,24 @@ static void suspend_vqs(struct mlx5_vdpa_net *ndev)
suspend_vq(ndev, >vqs[i]);
 }
 
+static void resume_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue 
*mvq)
+{
+   if (!mvq->initialized || !is_resumable(ndev))
+   return;
+
+   if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND)
+   return;
+
+   if (modify_virtqueue_state(ndev, mvq, 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY))
+   mlx5_vdpa_warn(>mvdev, "modify to resume failed for vq 
%u\n", mvq->index);
+}
+
+static void resume_vqs(struct mlx5_vdpa_net *ndev)
+{
+   for (int i = 0; i < ndev->mvdev.max_vqs; i++)
+   resume_vq(ndev, >vqs[i]);
+}
+
 static void teardown_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue 
*mvq)
 {
if (!mvq->initialized)
@@ -3261,6 +3295,23 @@ static int mlx5_vdpa_suspend(struct vdpa_device *vdev)
return 0;
 }
 
+static int mlx5_vdpa_resume(struct vdpa_device *vdev)
+{
+   struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
+   struct mlx5_vdpa_net *ndev;
+
+   ndev = to_mlx5_vdpa_ndev(mvdev);
+
+   mlx5_vdpa_info(mvdev, "resuming device\n");
+
+   down_write(>reslock);
+   mvdev->suspended = false;
+   resume_vqs(ndev);
+   register_link_notifier(ndev);
+   up_write(>reslock);
+   return 0;
+}
+
 static int mlx5_set_group_asid(struct vdpa_device *vdev, u32 group,
   unsigned int asid)
 {
@@ -3317,6 +3368,7 @@ static const struct vdpa_config_ops mlx5_vdpa_ops = {
.get_vq_dma_dev = mlx5_get_vq_dma_dev,
.free = mlx5_vdpa_free,
.suspend = mlx5_vdpa_suspend,
+   .resume = mlx5_vdpa_resume, /* Op disabled if not supported. */
 };
 
 static int query_mtu(struct mlx5_core_dev *mdev, u16 *mtu)
@@ 

[PATCH vhost v3 2/6] vdpa/mlx5: Allow modifying multiple vq fields in one modify command

2023-12-15 Thread Dragos Tatulea
Add a bitmask variable that tracks hw vq field changes that
are supposed to be modified on next hw vq change command.

This will be useful to set multiple vq fields when resuming the vq.

Reviewed-by: Gal Pressman 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 48 +--
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 26ba7da6b410..1e08a8805640 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -120,6 +120,9 @@ struct mlx5_vdpa_virtqueue {
u16 avail_idx;
u16 used_idx;
int fw_state;
+
+   u64 modified_fields;
+
struct msi_map map;
 
/* keep last in the struct */
@@ -1181,7 +1184,19 @@ static bool is_valid_state_change(int oldstate, int 
newstate)
}
 }
 
-static int modify_virtqueue(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *mvq, int state)
+static bool modifiable_virtqueue_fields(struct mlx5_vdpa_virtqueue *mvq)
+{
+   /* Only state is always modifiable */
+   if (mvq->modified_fields & ~MLX5_VIRTQ_MODIFY_MASK_STATE)
+   return mvq->fw_state == MLX5_VIRTIO_NET_Q_OBJECT_STATE_INIT ||
+  mvq->fw_state == MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND;
+
+   return true;
+}
+
+static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
+   struct mlx5_vdpa_virtqueue *mvq,
+   int state)
 {
int inlen = MLX5_ST_SZ_BYTES(modify_virtio_net_q_in);
u32 out[MLX5_ST_SZ_DW(modify_virtio_net_q_out)] = {};
@@ -1193,6 +1208,9 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtque
if (mvq->fw_state == MLX5_VIRTIO_NET_Q_OBJECT_NONE)
return 0;
 
+   if (!modifiable_virtqueue_fields(mvq))
+   return -EINVAL;
+
if (!is_valid_state_change(mvq->fw_state, state))
return -EINVAL;
 
@@ -1208,17 +1226,28 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtque
MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, uid, ndev->mvdev.res.uid);
 
obj_context = MLX5_ADDR_OF(modify_virtio_net_q_in, in, obj_context);
-   MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select,
-  MLX5_VIRTQ_MODIFY_MASK_STATE);
-   MLX5_SET(virtio_net_q_object, obj_context, state, state);
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE)
+   MLX5_SET(virtio_net_q_object, obj_context, state, state);
+
+   MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
mvq->modified_fields);
err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
kfree(in);
if (!err)
mvq->fw_state = state;
 
+   mvq->modified_fields = 0;
+
return err;
 }
 
+static int modify_virtqueue_state(struct mlx5_vdpa_net *ndev,
+ struct mlx5_vdpa_virtqueue *mvq,
+ unsigned int state)
+{
+   mvq->modified_fields |= MLX5_VIRTQ_MODIFY_MASK_STATE;
+   return modify_virtqueue(ndev, mvq, state);
+}
+
 static int counter_set_alloc(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *mvq)
 {
u32 in[MLX5_ST_SZ_DW(create_virtio_q_counters_in)] = {};
@@ -1347,7 +1376,7 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *mvq)
goto err_vq;
 
if (mvq->ready) {
-   err = modify_virtqueue(ndev, mvq, 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY);
+   err = modify_virtqueue_state(ndev, mvq, 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY);
if (err) {
mlx5_vdpa_warn(>mvdev, "failed to modify to ready 
vq idx %d(%d)\n",
   idx, err);
@@ -1382,7 +1411,7 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *m
if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY)
return;
 
-   if (modify_virtqueue(ndev, mvq, MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
+   if (modify_virtqueue_state(ndev, mvq, 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
mlx5_vdpa_warn(>mvdev, "modify to suspend failed\n");
 
if (query_virtqueue(ndev, mvq, )) {
@@ -1407,6 +1436,7 @@ static void teardown_vq(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtqueue *
return;
 
suspend_vq(ndev, mvq);
+   mvq->modified_fields = 0;
destroy_virtqueue(ndev, mvq);
dealloc_vector(ndev, mvq);
counter_set_dealloc(ndev, mvq);
@@ -2207,7 +2237,7 @@ static void mlx5_vdpa_set_vq_ready(struct vdpa_device 
*vdev, u16 idx, bool ready
if (!ready) {
suspend_vq(ndev, mvq);
} else {
-   err = modify_virtqueue(ndev, mvq, 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY);
+   err = 

[PATCH mlx5-vhost v3 1/6] vdpa/mlx5: Expose resumable vq capability

2023-12-15 Thread Dragos Tatulea
Necessary for checking if resumable vqs are supported by the hardware.
Actual support will be added in a downstream patch.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 include/linux/mlx5/mlx5_ifc.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 6f3631425f38..9eaceaf6bcb0 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1236,7 +1236,8 @@ struct mlx5_ifc_virtio_emulation_cap_bits {
 
u8 reserved_at_c0[0x13];
u8 desc_group_mkey_supported[0x1];
-   u8 reserved_at_d4[0xc];
+   u8 freeze_to_rdy_supported[0x1];
+   u8 reserved_at_d5[0xb];
 
u8 reserved_at_e0[0x20];
 
-- 
2.43.0




[PATCH vhost v3 0/6] vdpa/mlx5: Add support for resumable vqs

2023-12-15 Thread Dragos Tatulea
Add support for resumable vqs in the driver. This is a firmware feature
that can be used for the following benefits:
- Full device .suspend/.resume.
- .set_map doesn't need to destroy and create new vqs anymore just to
  update the map. When resumable vqs are supported it is enough to
  suspend the vqs, set the new maps, and then resume the vqs.

The first patch exposes the relevant bits in mlx5_ifc.h. That means it
needs to be applied to the mlx5-vhost tree [0] first. Once applied
there, the change has to be pulled from mlx5-vhost into the vhost tree
and only then the remaining patches can be applied. Same flow as the vq
descriptor mappings patchset [1].

The second part adds support for resumable vqs in the form of a device .resume
operation but also for the .set_map call (suspend/resume device instead
of re-creating vqs with new mappings).

The last part of the series introduces reference counting for mrs which
is necessary to avoid freeing mkeys too early or leaking them.

* Changes in v3:
- Dropped patches that allowed vq modification of state and addresses
  when state is DRIVER_OK. This is not allowed by the standard.
  Should be re-added under a vdpa feature flag.

* Changes in v2:
- Added mr refcounting patches.
- Deleted unnecessary patch: "vdpa/mlx5: Split function into locked and
  unlocked variants"
- Small print improvement in "Introduce per vq and device resume"
  patch.
- Patch 1/7 has been applied to mlx5-vhost branch.

[0] 
https://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git/log/?h=mlx5-vhost
[1] 
https://lore.kernel.org/virtualization/20231018171456.1624030-2-dtatu...@nvidia.com/


Dragos Tatulea (6):
  vdpa/mlx5: Expose resumable vq capability
  vdpa/mlx5: Allow modifying multiple vq fields in one modify command
  vdpa/mlx5: Introduce per vq and device resume
  vdpa/mlx5: Use vq suspend/resume during .set_map
  vdpa/mlx5: Introduce reference counting to mrs
  vdpa/mlx5: Add mkey leak detection

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  10 +-
 drivers/vdpa/mlx5/core/mr.c|  69 +++---
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 194 +
 include/linux/mlx5/mlx5_ifc.h  |   3 +-
 include/linux/mlx5/mlx5_ifc_vdpa.h |   1 +
 5 files changed, 239 insertions(+), 38 deletions(-)

-- 
2.43.0




[PATCH] ring-buffer: Have rb_time_cmpxchg() set the msb counter too

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The rb_time_cmpxchg() on 32-bit architectures requires setting three
32-bit words to represent the 64-bit timestamp, with some salt for
synchronization. Those are: msb, top, and bottom

The issue is, the rb_time_cmpxchg() did not properly salt the msb portion,
and the msb that was written was stale.

Fixes: f03f2abce4f39 ("ring-buffer: Have 32 bit time stamps use all 64 bits")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 78aacc78f03a..e9c10eabdb95 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -722,10 +722,12 @@ static bool rb_time_cmpxchg(rb_time_t *t, u64 expect, u64 
set)
 cnt2 = cnt + 1;
 
 rb_time_split(val, , , );
+msb = rb_time_val_cnt(msb, cnt);
 top = rb_time_val_cnt(top, cnt);
 bottom = rb_time_val_cnt(bottom, cnt);
 
 rb_time_split(set, , , );
+msb2 = rb_time_val_cnt(msb2, cnt);
 top2 = rb_time_val_cnt(top2, cnt2);
 bottom2 = rb_time_val_cnt(bottom2, cnt2);
 
-- 
2.42.0




[PATCH v2] ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

When filtering is enabled, a temporary buffer is created to place the
content of the trace event output so that the filter logic can decide
from the trace event output if the trace event should be filtered out or
not. If it is to be filtered out, the content in the temporary buffer is
simply discarded, otherwise it is written into the trace buffer.

But if an interrupt were to come in while a previous event was using that
temporary buffer, the event written by the interrupt would actually go
into the ring buffer itself to prevent corrupting the data on the
temporary buffer. If the event is to be filtered out, the event in the
ring buffer is discarded, or if it fails to discard because another event
were to have already come in, it is turned into padding.

The update to the write_stamp in the rb_try_to_discard() happens after a
fix was made to force the next event after the discard to use an absolute
timestamp by setting the before_stamp to zero so it does not match the
write_stamp (which causes an event to use the absolute timestamp).

But there's an effort in rb_try_to_discard() to put back the write_stamp
to what it was before the event was added. But this is useless and
wasteful because nothing is going to be using that write_stamp for
calculations as it still will not match the before_stamp.

Remove this useless update, and in doing so, we remove another
cmpxchg64()!

Also update the comments to reflect this change as well as remove some
extra white space in another comment.

Fixes: b2dd797543cf ("ring-buffer: Force absolute timestamp on discard of 
event")
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v1: 
https://lore.kernel.org/linux-trace-kernel/20231215074151.447e1...@rorschach.local.home/

- Once again I wrote this code on the commit that removes the 32-bit
  version of the rb_time_* functions. I need to move that commit to
  another branch.

 kernel/trace/ring_buffer.c | 47 +-
 1 file changed, 11 insertions(+), 36 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2668dde23343..ad4af0cba159 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2983,25 +2983,6 @@ static unsigned rb_calculate_event_length(unsigned 
length)
return length;
 }
 
-static u64 rb_time_delta(struct ring_buffer_event *event)
-{
-   switch (event->type_len) {
-   case RINGBUF_TYPE_PADDING:
-   return 0;
-
-   case RINGBUF_TYPE_TIME_EXTEND:
-   return rb_event_time_stamp(event);
-
-   case RINGBUF_TYPE_TIME_STAMP:
-   return 0;
-
-   case RINGBUF_TYPE_DATA:
-   return event->time_delta;
-   default:
-   return 0;
-   }
-}
-
 static inline bool
 rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
  struct ring_buffer_event *event)
@@ -3009,8 +2990,6 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
unsigned long new_index, old_index;
struct buffer_page *bpage;
unsigned long addr;
-   u64 write_stamp;
-   u64 delta;
 
new_index = rb_event_index(event);
old_index = new_index + rb_event_ts_length(event);
@@ -3019,14 +2998,10 @@ rb_try_to_discard(struct ring_buffer_per_cpu 
*cpu_buffer,
 
bpage = READ_ONCE(cpu_buffer->tail_page);
 
-   delta = rb_time_delta(event);
-
-   if (!rb_time_read(_buffer->write_stamp, _stamp))
-   return false;
-
-   /* Make sure the write stamp is read before testing the location */
-   barrier();
-
+   /*
+* Make sure the tail_page is still the same and
+* the next write location is the end of this event
+*/
if (bpage->page == (void *)addr && rb_page_write(bpage) == old_index) {
unsigned long write_mask =
local_read(>write) & ~RB_WRITE_MASK;
@@ -3037,20 +3012,20 @@ rb_try_to_discard(struct ring_buffer_per_cpu 
*cpu_buffer,
 * to make sure that the next event adds an absolute
 * value and does not rely on the saved write stamp, which
 * is now going to be bogus.
+*
+* By setting the before_stamp to zero, the next event
+* is not going to use the write_stamp and will instead
+* create an absolute timestamp. This means there's no
+* reason to update the wirte_stamp!
 */
rb_time_set(_buffer->before_stamp, 0);
 
-   /* Something came in, can't discard */
-   if (!rb_time_cmpxchg(_buffer->write_stamp,
-  write_stamp, write_stamp - delta))
-   return false;
-
/*
 * If an event were to come in now, it would see that the
 * write_stamp and the before_stamp are different, and assume
 * 

Re: [PATCH] ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()

2023-12-15 Thread Steven Rostedt
On Fri, 15 Dec 2023 07:41:51 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> When filtering is enabled, a temporary buffer is created to place the
> content of the trace event output so that the filter logic can decide
> from the trace event output if the trace event should be filtered out or
> not. If it is to be filtered out, the content in the temporary buffer is
> simply discarded, otherwise it is written into the trace buffer.
> 
> But if an interrupt were to come in while a previous event was using that
> temporary buffer, the event written by the interrupt would actually go
> into the ring buffer itself to prevent corrupting the data on the
> temporary buffer. If the event is to be filtered out, the event in the
> ring buffer is discarded, or if it fails to discard because another event
> were to have already come in, it is turned into padding.
> 
> The update to the write_stamp in the rb_try_to_discard() happens after a
> fix was made to force the next event after the discard to use an absolute
> timestamp by setting the before_stamp to zero so it does not match the
> write_stamp (which causes an event to use the absolute timestamp).
> 
> But there's an effort in rb_try_to_discard() to put back the write_stamp
> to what it was before the event was added. But this is useless and
> wasteful because nothing is going to be using that write_stamp for
> calculations as it still will not match the before_stamp.
> 
> Remove this useless update, and in doing so, we remove another
> cmpxchg64()!

While debugging the timestamp issues, I found that the cmpxchg64 in the
discard was actually useless after commit b2dd797543cf (in the fixes
below). This removes the third of the 4 cmpxchg64 in the ring buffer
code. The last one has:

u64 ts;
/* SLOW PATH - Interrupted between A and C */
rb_time_read(_buffer->write_stamp, >after);
ts = rb_time_stamp(cpu_buffer->buffer);
barrier();
 /*E*/  if (write == (local_read(_page->write) & RB_WRITE_MASK) &&
info->after < ts &&
rb_time_cmpxchg(_buffer->write_stamp,
info->after, ts)) {
/* Nothing came after this event between C and E */
info->delta = ts - info->after;
} else {

That could actually be changed to:

 /*E*/  if (write == (local_read(_page->write) & RB_WRITE_MASK) &&
info->after < ts && info->after == 
READ_ONCE(cpu_buffer->write_stamp)){
/* Nothing came after this event between C and E */
info->delta = ts - info->after;

where the only difference is that the before_stamp and write_stamp will
not match making the next event add an absolute_timestamp.

This would get rid of all the cmpxchg64(). I could have:

/*
 * Returns true if ts == old_ts, and if possible will update with new_ts,
 * but ts is not guaranteed to be updated even if this returns true
 */
static bool rb_time_cmp_and_update(rb_time_t *ts, u64 old_ts, u64 new_ts)
{
#if HAVE_CMPXCHG64
return local64_cmpxchg(ts, old_ts, new_ts) == old_ts;
#else
return old_ts == READ_ONCE(*ts);
#endif
}

 /*E*/  if (write == (local_read(_page->write) & RB_WRITE_MASK) &&
info->after < ts &&
rb_time_cmp_and_update(_buffer->write_stamp, 
info->after, ts) {
/* Nothing came after this event between C and E */
info->delta = ts - info->after;


And this way we don't even need any CMPXCHG64. And for those that do
have it, it gets the benefit of not having to insert an absolute
timestamp for the next event. That's just a nice-to-have and not
critical for the logic.

-- Steve


> 
> Also update the comments to reflect this change as well as remove some
> extra white space in another comment.
> 
> Fixes: b2dd797543cf ("ring-buffer: Force absolute timestamp on discard of 
> event")
> Signed-off-by: Steven Rostedt (Google) 
> ---



Re: ARM64 Livepatch based on SFrame

2023-12-15 Thread Mark Rutland
On Thu, Dec 14, 2023 at 02:49:29PM -0600, Madhavan T. Venkataraman wrote:
> Hi Mark,

Hi Madhavan,

> I attended your presentation in the LPC. You mentioned that you could use
> some help with some pre-requisites for the Livepatch feature.
> I would like to lend a hand.

Cool!

I've been meaning to send a mail round with a summary of the current state of
things, and what needs to be done going forward, but I haven't had the time
since LPC to put that together (as e.g. that requires learning some more about
SFrame).  I'll be disappearing for the holiday shortly, and I intend to pick
that up in the new year.

> What would you like me to implement?

I'm not currently sure exactly what we need/want to implement, and as above I
think that needs to wait until the new year.

However, one thing that you can do that would be very useful is to write up and
report the GCC DWARF issues that you mentioned in:

  
https://lore.kernel.org/linux-arm-kernel/20230202074036.507249-1-madve...@linux.microsoft.com/

... as (depending on exactly what those are) those could also affect SFrame
generation (and thus we'll need to get those fixed in GCC), and regardless it
would be useful information to know.

I understood that you planned to do that from:

  
https://lore.kernel.org/linux-arm-kernel/054ce0d6-70f0-b834-d4e5-1049c8df7...@linux.microsoft.com/

... but I couldn't spot any relevant mails or issues in the GCC bugzilla, so
either I'm failing to search hard enough, or did that get forgotten about?

> I would also like to implement Unwind Hints for the feature. If you want a
> specific format for the hints, let me know.

I will get back to you on that in the new year; I think the specifics we want
are going to depend on other details above we need to analyse first.

Thanks,
Mark.



[PATCH] ring-buffer: Remove useless update to write_stamp in rb_try_to_discard()

2023-12-15 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

When filtering is enabled, a temporary buffer is created to place the
content of the trace event output so that the filter logic can decide
from the trace event output if the trace event should be filtered out or
not. If it is to be filtered out, the content in the temporary buffer is
simply discarded, otherwise it is written into the trace buffer.

But if an interrupt were to come in while a previous event was using that
temporary buffer, the event written by the interrupt would actually go
into the ring buffer itself to prevent corrupting the data on the
temporary buffer. If the event is to be filtered out, the event in the
ring buffer is discarded, or if it fails to discard because another event
were to have already come in, it is turned into padding.

The update to the write_stamp in the rb_try_to_discard() happens after a
fix was made to force the next event after the discard to use an absolute
timestamp by setting the before_stamp to zero so it does not match the
write_stamp (which causes an event to use the absolute timestamp).

But there's an effort in rb_try_to_discard() to put back the write_stamp
to what it was before the event was added. But this is useless and
wasteful because nothing is going to be using that write_stamp for
calculations as it still will not match the before_stamp.

Remove this useless update, and in doing so, we remove another
cmpxchg64()!

Also update the comments to reflect this change as well as remove some
extra white space in another comment.

Fixes: b2dd797543cf ("ring-buffer: Force absolute timestamp on discard of 
event")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 46 +-
 1 file changed, 11 insertions(+), 35 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 378ff9e53fe6..56b97a16ea16 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2792,25 +2792,6 @@ static unsigned rb_calculate_event_length(unsigned 
length)
return length;
 }
 
-static u64 rb_time_delta(struct ring_buffer_event *event)
-{
-   switch (event->type_len) {
-   case RINGBUF_TYPE_PADDING:
-   return 0;
-
-   case RINGBUF_TYPE_TIME_EXTEND:
-   return rb_event_time_stamp(event);
-
-   case RINGBUF_TYPE_TIME_STAMP:
-   return 0;
-
-   case RINGBUF_TYPE_DATA:
-   return event->time_delta;
-   default:
-   return 0;
-   }
-}
-
 static inline bool
 rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
  struct ring_buffer_event *event)
@@ -2818,8 +2799,6 @@ rb_try_to_discard(struct ring_buffer_per_cpu *cpu_buffer,
unsigned long new_index, old_index;
struct buffer_page *bpage;
unsigned long addr;
-   u64 write_stamp;
-   u64 delta;
 
new_index = rb_event_index(event);
old_index = new_index + rb_event_ts_length(event);
@@ -2828,13 +2807,10 @@ rb_try_to_discard(struct ring_buffer_per_cpu 
*cpu_buffer,
 
bpage = READ_ONCE(cpu_buffer->tail_page);
 
-   delta = rb_time_delta(event);
-
-   rb_time_read(_buffer->write_stamp, _stamp);
-
-   /* Make sure the write stamp is read before testing the location */
-   barrier();
-
+   /*
+* Make sure the tail_page is still the same and
+* the next write location is the end of this event
+*/
if (bpage->page == (void *)addr && rb_page_write(bpage) == old_index) {
unsigned long write_mask =
local_read(>write) & ~RB_WRITE_MASK;
@@ -2845,20 +2821,20 @@ rb_try_to_discard(struct ring_buffer_per_cpu 
*cpu_buffer,
 * to make sure that the next event adds an absolute
 * value and does not rely on the saved write stamp, which
 * is now going to be bogus.
+*
+* By setting the before_stamp to zero, the next event
+* is not going to use the write_stamp and will instead
+* create an absolute timestamp. This means there's no
+* reason to update the wirte_stamp!
 */
rb_time_set(_buffer->before_stamp, 0);
 
-   /* Something came in, can't discard */
-   if (!rb_time_cmpxchg(_buffer->write_stamp,
-  write_stamp, write_stamp - delta))
-   return false;
-
/*
 * If an event were to come in now, it would see that the
 * write_stamp and the before_stamp are different, and assume
 * that this event just added itself before updating
 * the write stamp. The interrupting event will fix the
-* write stamp for us, and use the before stamp as its delta.
+* write stamp for us, and use an absolute timestamp.
 */
 
   

Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq

2023-12-15 Thread Dragos Tatulea
On Thu, 2023-12-14 at 19:30 +0100, Eugenio Perez Martin wrote:
> On Thu, Dec 14, 2023 at 4:51 PM Dragos Tatulea  wrote:
> > 
> > On Thu, 2023-12-14 at 08:45 -0500, Michael S. Tsirkin wrote:
> > > On Thu, Dec 14, 2023 at 01:39:55PM +, Dragos Tatulea wrote:
> > > > On Tue, 2023-12-12 at 15:44 -0800, Si-Wei Liu wrote:
> > > > > 
> > > > > On 12/12/2023 11:21 AM, Eugenio Perez Martin wrote:
> > > > > > On Tue, Dec 5, 2023 at 11:46 AM Dragos Tatulea 
> > > > > >  wrote:
> > > > > > > Addresses get set by .set_vq_address. hw vq addresses will be 
> > > > > > > updated on
> > > > > > > next modify_virtqueue.
> > > > > > > 
> > > > > > > Signed-off-by: Dragos Tatulea 
> > > > > > > Reviewed-by: Gal Pressman 
> > > > > > > Acked-by: Eugenio Pérez 
> > > > > > I'm kind of ok with this patch and the next one about state, but I
> > > > > > didn't ack them in the previous series.
> > > > > > 
> > > > > > My main concern is that it is not valid to change the vq address 
> > > > > > after
> > > > > > DRIVER_OK in VirtIO, which vDPA follows. Only memory maps are ok to
> > > > > > change at this moment. I'm not sure about vq state in vDPA, but 
> > > > > > vhost
> > > > > > forbids changing it with an active backend.
> > > > > > 
> > > > > > Suspend is not defined in VirtIO at this moment though, so maybe it 
> > > > > > is
> > > > > > ok to decide that all of these parameters may change during suspend.
> > > > > > Maybe the best thing is to protect this with a vDPA feature flag.
> > > > > I think protect with vDPA feature flag could work, while on the other
> > > > > hand vDPA means vendor specific optimization is possible around 
> > > > > suspend
> > > > > and resume (in case it helps performance), which doesn't have to be
> > > > > backed by virtio spec. Same applies to vhost user backend features,
> > > > > variations there were not backed by spec either. Of course, we should
> > > > > try best to make the default behavior backward compatible with
> > > > > virtio-based backend, but that circles back to no suspend definition 
> > > > > in
> > > > > the current virtio spec, for which I hope we don't cease development 
> > > > > on
> > > > > vDPA indefinitely. After all, the virtio based vdap backend can well
> > > > > define its own feature flag to describe (minor difference in) the
> > > > > suspend behavior based on the later spec once it is formed in future.
> > > > > 
> > > > So what is the way forward here? From what I understand the options are:
> > > > 
> > > > 1) Add a vdpa feature flag for changing device properties while 
> > > > suspended.
> > > > 
> > > > 2) Drop these 2 patches from the series for now. Not sure if this makes 
> > > > sense as
> > > > this. But then Si-Wei's qemu device suspend/resume poc [0] that 
> > > > exercises this
> > > > code won't work anymore. This means the series would be less well 
> > > > tested.
> > > > 
> > > > Are there other possible options? What do you think?
> > > > 
> > > > [0] https://github.com/siwliu-kernel/qemu/tree/svq-resume-wip
> > > 
> > > I am fine with either of these.
> > > 
> > How about allowing the change only under the following conditions:
> >   vhost_vdpa_can_suspend && vhost_vdpa_can_resume &&
> > VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is set
> > 
> > ?
> 
> I think the best option by far is 1, as there is no hint in the
> combination of these 3 indicating that you can change device
> properties in the suspended state.
> 
Sure. Will respin a v3 without these two patches.

Another series can implement option 2 and add these 2 patches on top.

> > 
> > Thanks,
> > Dragos
> > 
> > > > Thanks,
> > > > Dragos
> > > > 
> > > > > Regards,
> > > > > -Siwei
> > > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > Jason, what do you think?
> > > > > > 
> > > > > > Thanks!
> > > > > > 
> > > > > > > ---
> > > > > > >   drivers/vdpa/mlx5/net/mlx5_vnet.c  | 9 +
> > > > > > >   include/linux/mlx5/mlx5_ifc_vdpa.h | 1 +
> > > > > > >   2 files changed, 10 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> > > > > > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > > > index f8f088cced50..80e066de0866 100644
> > > > > > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > > > > > @@ -1209,6 +1209,7 @@ static int modify_virtqueue(struct 
> > > > > > > mlx5_vdpa_net *ndev,
> > > > > > >  bool state_change = false;
> > > > > > >  void *obj_context;
> > > > > > >  void *cmd_hdr;
> > > > > > > +   void *vq_ctx;
> > > > > > >  void *in;
> > > > > > >  int err;
> > > > > > > 
> > > > > > > @@ -1230,6 +1231,7 @@ static int modify_virtqueue(struct 
> > > > > > > mlx5_vdpa_net *ndev,
> > > > > > >  MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, uid, 
> > > > > > > ndev->mvdev.res.uid);
> > > > > > > 
> > > > > > >  obj_context = MLX5_ADDR_OF(modify_virtio_net_q_in, in, 
> > > > > > > obj_context);
> > > > > > > +   

Re: [PATCH net-next v10 0/3] send credit update during setting SO_RCVLOWAT

2023-12-15 Thread patchwork-bot+netdevbpf
Hello:

This series was applied to netdev/net-next.git (main)
by David S. Miller :

On Thu, 14 Dec 2023 15:52:27 +0300 you wrote:
> Hello,
> 
>DESCRIPTION
> 
> This patchset fixes old problem with hungup of both rx/tx sides and adds
> test for it. This happens due to non-default SO_RCVLOWAT value and
> deferred credit update in virtio/vsock. Link to previous old patchset:
> https://lore.kernel.org/netdev/39b2e9fd-601b-189d-39a9-914e55745...@sberdevices.ru/
> 
> [...]

Here is the summary with links:
  - [net-next,v10,1/3] virtio/vsock: fix logic which reduces credit update 
messages
https://git.kernel.org/netdev/net-next/c/93b808876682
  - [net-next,v10,2/3] virtio/vsock: send credit update during setting 
SO_RCVLOWAT
https://git.kernel.org/netdev/net-next/c/0fe179896811
  - [net-next,v10,3/3] vsock/test: two tests to check credit update logic
https://git.kernel.org/netdev/net-next/c/542e893fbadc

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html