date:20231219

Re: [PATCH v3] virtio_pmem: support feature SHMEM_REGION

2023-12-19 Thread Michael S. Tsirkin

On Tue, Dec 19, 2023 at 11:32:27PM -0800, Changyuan Lyu wrote:
> On Tue, Dec 19, 2023 at 11:01 PM Michael S. Tsirkin  wrote:
> > 
> > This is not a great description. Please describe what the patch does.
> 
> Thanks for the feedback! Please see the v3 patch below.
> 
> ---8<---
> 
> This patch adds the support for feature VIRTIO_PMEM_F_SHMEM_REGION
> (virtio spec v1.2 section 5.19.5.2 [1]). Feature bit
> VIRTIO_PMEM_F_SHMEM_REGION is added to the driver feature
> table.
> 
> If the driver feature bit VIRTIO_PMEM_F_SHMEM_REGION is found,
> during probe, virtio pmem ignores the `start` and `size` fields in
> device config and looks for a shared memory region of id 0. The
> physical address range of the pmem is then determined by the physical
> address range of shared memory region 0.
> 
> [1] 
> https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-6480002
> 
> Signed-off-by: Changyuan Lyu 
> 
> ---
> v3:
>   * updated the patch description.
> V2:
>   * renamed VIRTIO_PMEM_SHMCAP_ID to VIRTIO_PMEM_SHMEM_REGION_ID
>   * fixed the error handling when region 0 does not exist
> ---
>  drivers/nvdimm/virtio_pmem.c | 30 ++
>  include/uapi/linux/virtio_pmem.h |  8 
>  2 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> index a92eb172f0e7..8e447c7558cb 100644
> --- a/drivers/nvdimm/virtio_pmem.c
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -35,6 +35,8 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
>   struct nd_region *nd_region;
>   struct virtio_pmem *vpmem;
>   struct resource res;
> + struct virtio_shm_region shm_reg;
> + bool have_shm;
>   int err = 0;
>  
>   if (!vdev->config->get) {
> @@ -57,10 +59,24 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
>   goto out_err;
>   }
>  
> - virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> - start, &vpmem->start);
> - virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> - size, &vpmem->size);
> + if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION)) {
> + have_shm = virtio_get_shm_region(vdev, &shm_reg,
> + (u8)VIRTIO_PMEM_SHMEM_REGION_ID);
> + if (!have_shm) {
> + dev_err(&vdev->dev, "failed to get shared memory region 
> %d\n",
> + VIRTIO_PMEM_SHMEM_REGION_ID);
> + err = -ENXIO;
> + goto out_vq;

Maybe additionally, add a validate callback and clear
VIRTIO_PMEM_F_SHMEM_REGION if VIRTIO_PMEM_SHMEM_REGION_ID is not there.




> + }
> + vpmem->start = shm_reg.addr;
> + vpmem->size = shm_reg.len;
> + } else {
> + virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> + start, &vpmem->start);
> + virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> + size, &vpmem->size);
> + }
> +
>  
>   res.start = vpmem->start;
>   res.end   = vpmem->start + vpmem->size - 1;
> @@ -122,7 +138,13 @@ static void virtio_pmem_remove(struct virtio_device 
> *vdev)
>   virtio_reset_device(vdev);
>  }
>  
> +static unsigned int features[] = {
> + VIRTIO_PMEM_F_SHMEM_REGION,
> +};
> +
>  static struct virtio_driver virtio_pmem_driver = {
> + .feature_table  = features,
> + .feature_table_size = ARRAY_SIZE(features),
>   .driver.name= KBUILD_MODNAME,
>   .driver.owner   = THIS_MODULE,
>   .id_table   = id_table,
> diff --git a/include/uapi/linux/virtio_pmem.h 
> b/include/uapi/linux/virtio_pmem.h
> index d676b3620383..c5e49b6e58b1 100644
> --- a/include/uapi/linux/virtio_pmem.h
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -14,6 +14,14 @@
>  #include 
>  #include 
>  
> +/* Feature bits */
> +#define VIRTIO_PMEM_F_SHMEM_REGION 0 /* guest physical address range will be
> +  * indicated as shared memory region 0
> +  */


Either make this comment shorter to fit in one line, or put the
multi-line comment before the define.

> +
> +/* shmid of the shared memory region corresponding to the pmem */
> +#define VIRTIO_PMEM_SHMEM_REGION_ID 0
> +
>  struct virtio_pmem_config {
>   __le64 start;
>   __le64 size;
> -- 
> 2.43.0.472.g3155946c3a-goog

[PATCH v3] virtio_pmem: support feature SHMEM_REGION

2023-12-19 Thread Changyuan Lyu

On Tue, Dec 19, 2023 at 11:01 PM Michael S. Tsirkin  wrote:
> 
> This is not a great description. Please describe what the patch does.

Thanks for the feedback! Please see the v3 patch below.

---8<---

This patch adds the support for feature VIRTIO_PMEM_F_SHMEM_REGION
(virtio spec v1.2 section 5.19.5.2 [1]). Feature bit
VIRTIO_PMEM_F_SHMEM_REGION is added to the driver feature
table.

If the driver feature bit VIRTIO_PMEM_F_SHMEM_REGION is found,
during probe, virtio pmem ignores the `start` and `size` fields in
device config and looks for a shared memory region of id 0. The
physical address range of the pmem is then determined by the physical
address range of shared memory region 0.

[1] 
https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-6480002

Signed-off-by: Changyuan Lyu 

---
v3:
  * updated the patch description.
V2:
  * renamed VIRTIO_PMEM_SHMCAP_ID to VIRTIO_PMEM_SHMEM_REGION_ID
  * fixed the error handling when region 0 does not exist
---
 drivers/nvdimm/virtio_pmem.c | 30 ++
 include/uapi/linux/virtio_pmem.h |  8 
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
index a92eb172f0e7..8e447c7558cb 100644
--- a/drivers/nvdimm/virtio_pmem.c
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -35,6 +35,8 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
struct nd_region *nd_region;
struct virtio_pmem *vpmem;
struct resource res;
+   struct virtio_shm_region shm_reg;
+   bool have_shm;
int err = 0;
 
if (!vdev->config->get) {
@@ -57,10 +59,24 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
goto out_err;
}
 
-   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
-   start, &vpmem->start);
-   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
-   size, &vpmem->size);
+   if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION)) {
+   have_shm = virtio_get_shm_region(vdev, &shm_reg,
+   (u8)VIRTIO_PMEM_SHMEM_REGION_ID);
+   if (!have_shm) {
+   dev_err(&vdev->dev, "failed to get shared memory region 
%d\n",
+   VIRTIO_PMEM_SHMEM_REGION_ID);
+   err = -ENXIO;
+   goto out_vq;
+   }
+   vpmem->start = shm_reg.addr;
+   vpmem->size = shm_reg.len;
+   } else {
+   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
+   start, &vpmem->start);
+   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
+   size, &vpmem->size);
+   }
+
 
res.start = vpmem->start;
res.end   = vpmem->start + vpmem->size - 1;
@@ -122,7 +138,13 @@ static void virtio_pmem_remove(struct virtio_device *vdev)
virtio_reset_device(vdev);
 }
 
+static unsigned int features[] = {
+   VIRTIO_PMEM_F_SHMEM_REGION,
+};
+
 static struct virtio_driver virtio_pmem_driver = {
+   .feature_table  = features,
+   .feature_table_size = ARRAY_SIZE(features),
.driver.name= KBUILD_MODNAME,
.driver.owner   = THIS_MODULE,
.id_table   = id_table,
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
index d676b3620383..c5e49b6e58b1 100644
--- a/include/uapi/linux/virtio_pmem.h
+++ b/include/uapi/linux/virtio_pmem.h
@@ -14,6 +14,14 @@
 #include 
 #include 
 
+/* Feature bits */
+#define VIRTIO_PMEM_F_SHMEM_REGION 0   /* guest physical address range will be
+* indicated as shared memory region 0
+*/
+
+/* shmid of the shared memory region corresponding to the pmem */
+#define VIRTIO_PMEM_SHMEM_REGION_ID 0
+
 struct virtio_pmem_config {
__le64 start;
__le64 size;
-- 
2.43.0.472.g3155946c3a-goog

Re: [PATCH v2] virtio_pmem: support feature SHMEM_REGION

2023-12-19 Thread Michael S. Tsirkin

On Tue, Dec 19, 2023 at 10:13:00PM -0800, Changyuan Lyu wrote:
> On Tue, Dec 19, 2023 at 7:57 PM Jason Wang  wrote:
> >
> > On Tue, Dec 19, 2023 at 3:19 PM Changyuan Lyu  wrote:
> > >
> > > +/* shmid of the shared memory region corresponding to the pmem */
> > > +#define VIRTIO_PMEM_SHMCAP_ID 0
> >
> > NIT: not a native speaker, but any reason for "CAP" here? Would it be
> > better to use SHMMEM_REGION_ID?
> 
> I was following the name VIRTIO_FS_SHMCAP_ID_CACHE in
> include/uapi/linux/virtio_fs.h, where I guess "CAP" was referring to
> the shared memory capability when the device is on PCI bus. I agree
> SHMMEM_REGION_ID is a better name.
> 
> On Tue, Dec 19, 2023 at 3:19 PM Changyuan Lyu  wrote:
> >
> > +   if (!have_shm) {
> > +   dev_err(&vdev->dev, "failed to get shared memory region 
> > %d\n",
> > +   VIRTIO_PMEM_SHMCAP_ID);
> > +   return -EINVAL;
> > +   }
> 
> I realized that it needs to jump to tag out_vq to clean up vqs
> instead of simply returnning an error.
> 
> Thanks for reviewing the patch!
> 
> ---8<---
> 
> As per virtio spec 1.2 section 5.19.5.2, if the feature
> VIRTIO_PMEM_F_SHMEM_REGION has been negotiated, the driver MUST query
> shared memory ID 0 for the physical address ranges.

This is not a great description. Please describe what the patch does.

> Signed-off-by: Changyuan Lyu 
> 
> ---
> V2:
>   * renamed VIRTIO_PMEM_SHMCAP_ID to VIRTIO_PMEM_SHMEM_REGION_ID
>   * fixed the error handling when region 0 does not exist
> ---
>  drivers/nvdimm/virtio_pmem.c | 30 ++
>  include/uapi/linux/virtio_pmem.h |  8 
>  2 files changed, 34 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> index a92eb172f0e7..8e447c7558cb 100644
> --- a/drivers/nvdimm/virtio_pmem.c
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -35,6 +35,8 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
>   struct nd_region *nd_region;
>   struct virtio_pmem *vpmem;
>   struct resource res;
> + struct virtio_shm_region shm_reg;
> + bool have_shm;
>   int err = 0;
>  
>   if (!vdev->config->get) {
> @@ -57,10 +59,24 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
>   goto out_err;
>   }
>  
> - virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> - start, &vpmem->start);
> - virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> - size, &vpmem->size);
> + if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION)) {
> + have_shm = virtio_get_shm_region(vdev, &shm_reg,
> + (u8)VIRTIO_PMEM_SHMEM_REGION_ID);
> + if (!have_shm) {
> + dev_err(&vdev->dev, "failed to get shared memory region 
> %d\n",
> + VIRTIO_PMEM_SHMEM_REGION_ID);
> + err = -ENXIO;
> + goto out_vq;
> + }
> + vpmem->start = shm_reg.addr;
> + vpmem->size = shm_reg.len;
> + } else {
> + virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> + start, &vpmem->start);
> + virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> + size, &vpmem->size);
> + }
> +
>  
>   res.start = vpmem->start;
>   res.end   = vpmem->start + vpmem->size - 1;
> @@ -122,7 +138,13 @@ static void virtio_pmem_remove(struct virtio_device 
> *vdev)
>   virtio_reset_device(vdev);
>  }
>  
> +static unsigned int features[] = {
> + VIRTIO_PMEM_F_SHMEM_REGION,
> +};
> +
>  static struct virtio_driver virtio_pmem_driver = {
> + .feature_table  = features,
> + .feature_table_size = ARRAY_SIZE(features),
>   .driver.name= KBUILD_MODNAME,
>   .driver.owner   = THIS_MODULE,
>   .id_table   = id_table,
> diff --git a/include/uapi/linux/virtio_pmem.h 
> b/include/uapi/linux/virtio_pmem.h
> index d676b3620383..c5e49b6e58b1 100644
> --- a/include/uapi/linux/virtio_pmem.h
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -14,6 +14,14 @@
>  #include 
>  #include 
>  
> +/* Feature bits */
> +#define VIRTIO_PMEM_F_SHMEM_REGION 0 /* guest physical address range will be
> +  * indicated as shared memory region 0
> +  */
> +
> +/* shmid of the shared memory region corresponding to the pmem */
> +#define VIRTIO_PMEM_SHMEM_REGION_ID 0
> +
>  struct virtio_pmem_config {
>   __le64 start;
>   __le64 size;
> -- 
> 2.43.0.472.g3155946c3a-goog

[PATCH v2] virtio_pmem: support feature SHMEM_REGION

2023-12-19 Thread Changyuan Lyu

On Tue, Dec 19, 2023 at 7:57 PM Jason Wang  wrote:
>
> On Tue, Dec 19, 2023 at 3:19 PM Changyuan Lyu  wrote:
> >
> > +/* shmid of the shared memory region corresponding to the pmem */
> > +#define VIRTIO_PMEM_SHMCAP_ID 0
>
> NIT: not a native speaker, but any reason for "CAP" here? Would it be
> better to use SHMMEM_REGION_ID?

I was following the name VIRTIO_FS_SHMCAP_ID_CACHE in
include/uapi/linux/virtio_fs.h, where I guess "CAP" was referring to
the shared memory capability when the device is on PCI bus. I agree
SHMMEM_REGION_ID is a better name.

On Tue, Dec 19, 2023 at 3:19 PM Changyuan Lyu  wrote:
>
> + if (!have_shm) {
> + dev_err(&vdev->dev, "failed to get shared memory region 
> %d\n",
> + VIRTIO_PMEM_SHMCAP_ID);
> + return -EINVAL;
> + }

I realized that it needs to jump to tag out_vq to clean up vqs
instead of simply returnning an error.

Thanks for reviewing the patch!

---8<---

As per virtio spec 1.2 section 5.19.5.2, if the feature
VIRTIO_PMEM_F_SHMEM_REGION has been negotiated, the driver MUST query
shared memory ID 0 for the physical address ranges.

Signed-off-by: Changyuan Lyu 

---
V2:
  * renamed VIRTIO_PMEM_SHMCAP_ID to VIRTIO_PMEM_SHMEM_REGION_ID
  * fixed the error handling when region 0 does not exist
---
 drivers/nvdimm/virtio_pmem.c | 30 ++
 include/uapi/linux/virtio_pmem.h |  8 
 2 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
index a92eb172f0e7..8e447c7558cb 100644
--- a/drivers/nvdimm/virtio_pmem.c
+++ b/drivers/nvdimm/virtio_pmem.c
@@ -35,6 +35,8 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
struct nd_region *nd_region;
struct virtio_pmem *vpmem;
struct resource res;
+   struct virtio_shm_region shm_reg;
+   bool have_shm;
int err = 0;
 
if (!vdev->config->get) {
@@ -57,10 +59,24 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
goto out_err;
}
 
-   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
-   start, &vpmem->start);
-   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
-   size, &vpmem->size);
+   if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION)) {
+   have_shm = virtio_get_shm_region(vdev, &shm_reg,
+   (u8)VIRTIO_PMEM_SHMEM_REGION_ID);
+   if (!have_shm) {
+   dev_err(&vdev->dev, "failed to get shared memory region 
%d\n",
+   VIRTIO_PMEM_SHMEM_REGION_ID);
+   err = -ENXIO;
+   goto out_vq;
+   }
+   vpmem->start = shm_reg.addr;
+   vpmem->size = shm_reg.len;
+   } else {
+   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
+   start, &vpmem->start);
+   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
+   size, &vpmem->size);
+   }
+
 
res.start = vpmem->start;
res.end   = vpmem->start + vpmem->size - 1;
@@ -122,7 +138,13 @@ static void virtio_pmem_remove(struct virtio_device *vdev)
virtio_reset_device(vdev);
 }
 
+static unsigned int features[] = {
+   VIRTIO_PMEM_F_SHMEM_REGION,
+};
+
 static struct virtio_driver virtio_pmem_driver = {
+   .feature_table  = features,
+   .feature_table_size = ARRAY_SIZE(features),
.driver.name= KBUILD_MODNAME,
.driver.owner   = THIS_MODULE,
.id_table   = id_table,
diff --git a/include/uapi/linux/virtio_pmem.h b/include/uapi/linux/virtio_pmem.h
index d676b3620383..c5e49b6e58b1 100644
--- a/include/uapi/linux/virtio_pmem.h
+++ b/include/uapi/linux/virtio_pmem.h
@@ -14,6 +14,14 @@
 #include 
 #include 
 
+/* Feature bits */
+#define VIRTIO_PMEM_F_SHMEM_REGION 0   /* guest physical address range will be
+* indicated as shared memory region 0
+*/
+
+/* shmid of the shared memory region corresponding to the pmem */
+#define VIRTIO_PMEM_SHMEM_REGION_ID 0
+
 struct virtio_pmem_config {
__le64 start;
__le64 size;
-- 
2.43.0.472.g3155946c3a-goog

[PATCH] tracing/synthetic: fix kernel-doc warnings

2023-12-19 Thread Randy Dunlap

scripts/kernel-doc warns about using @args: for variadic arguments to
functions. Documentation/doc-guide/kernel-doc.rst says that this should
be written as @...: instead, so update the source code to match that,
preventing the warnings.

trace_events_synth.c:1165: warning: Excess function parameter 'args' 
description in '__synth_event_gen_cmd_start'
trace_events_synth.c:1714: warning: Excess function parameter 'args' 
description in 'synth_event_trace'

Signed-off-by: Randy Dunlap 
Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Mathieu Desnoyers 
Cc: linux-trace-ker...@vger.kernel.org
---
 kernel/trace/trace_events_synth.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff -- a/kernel/trace/trace_events_synth.c b/kernel/trace/trace_events_synth.c
--- a/kernel/trace/trace_events_synth.c
+++ b/kernel/trace/trace_events_synth.c
@@ -1137,7 +1137,7 @@ EXPORT_SYMBOL_GPL(synth_event_add_fields
  * @cmd: A pointer to the dynevent_cmd struct representing the new event
  * @name: The name of the synthetic event
  * @mod: The module creating the event, NULL if not created from a module
- * @args: Variable number of arg (pairs), one pair for each field
+ * @...: Variable number of arg (pairs), one pair for each field
  *
  * NOTE: Users normally won't want to call this function directly, but
  * rather use the synth_event_gen_cmd_start() wrapper, which
@@ -1695,7 +1695,7 @@ __synth_event_trace_end(struct synth_eve
  * synth_event_trace - Trace a synthetic event
  * @file: The trace_event_file representing the synthetic event
  * @n_vals: The number of values in vals
- * @args: Variable number of args containing the event values
+ * @...: Variable number of args containing the event values
  *
  * Trace a synthetic event using the values passed in the variable
  * argument list.

[PATCH V4] remoteproc: qcom: q6v5: Get crash reason from specific SMEM partition

2023-12-19 Thread Vignesh Viswanathan

q6v5 fatal and watchdog IRQ handlers always retrieves the crash reason
information from SMEM global partition (QCOM_SMEM_HOST_ANY).

For some targets like IPQ9574 and IPQ5332, crash reason information is
present in target specific partition due to which the crash reason is
not printed in the current implementation.

Add support to pass crash_reason_partition along with crash_reason_item
number in qcom_q6v5_init call and use the same to get the crash
information from SMEM in fatal and watchdog IRQ handlers.

While at it, rename all instances of "crash_reason_smem" with
"crash_reason_item" as this reflects the proper meaning.

Signed-off-by: Vignesh Viswanathan 
---
Changes in V4: Rename all instances of crash_reason_smem to crash_reason_item
Changes in V3: Updated commit message.
Changes in V2: Addressed comments in V1.

This patch depends on [1] which adds support for IPQ9574 and IPQ5332
remoteproc q5v5_mpd driver.

[1]: 
https://lore.kernel.org/all/20231110091939.3025413-1-quic_mmani...@quicinc.com/

 drivers/remoteproc/qcom_q6v5.c  | 10 +++--
 drivers/remoteproc/qcom_q6v5.h  |  6 ++-
 drivers/remoteproc/qcom_q6v5_adsp.c | 17 +
 drivers/remoteproc/qcom_q6v5_mpd.c  | 13 ---
 drivers/remoteproc/qcom_q6v5_mss.c  |  5 ++-
 drivers/remoteproc/qcom_q6v5_pas.c  | 59 +++--
 drivers/remoteproc/qcom_q6v5_wcss.c | 12 +++---
 7 files changed, 66 insertions(+), 56 deletions(-)

diff --git a/drivers/remoteproc/qcom_q6v5.c b/drivers/remoteproc/qcom_q6v5.c
index 0e32f13c196d..e4a28bf25130 100644
--- a/drivers/remoteproc/qcom_q6v5.c
+++ b/drivers/remoteproc/qcom_q6v5.c
@@ -100,7 +100,7 @@ static irqreturn_t q6v5_wdog_interrupt(int irq, void *data)
return IRQ_HANDLED;
}
 
-   msg = qcom_smem_get(QCOM_SMEM_HOST_ANY, q6v5->crash_reason, &len);
+   msg = qcom_smem_get(q6v5->crash_reason_partition, 
q6v5->crash_reason_item, &len);
if (!IS_ERR(msg) && len > 0 && msg[0])
dev_err(q6v5->dev, "watchdog received: %s\n", msg);
else
@@ -121,7 +121,7 @@ irqreturn_t q6v5_fatal_interrupt(int irq, void *data)
if (!q6v5->running)
return IRQ_HANDLED;
 
-   msg = qcom_smem_get(QCOM_SMEM_HOST_ANY, q6v5->crash_reason, &len);
+   msg = qcom_smem_get(q6v5->crash_reason_partition, 
q6v5->crash_reason_item, &len);
if (!IS_ERR(msg) && len > 0 && msg[0])
dev_err(q6v5->dev, "fatal error received: %s\n", msg);
else
@@ -279,14 +279,16 @@ EXPORT_SYMBOL_GPL(qcom_q6v5_panic);
  * Return: 0 on success, negative errno on failure
  */
 int qcom_q6v5_init(struct qcom_q6v5 *q6v5, struct platform_device *pdev,
-  struct rproc *rproc, int crash_reason, const char 
*load_state,
+  struct rproc *rproc, int crash_reason_partition,
+  int crash_reason_item, const char *load_state,
   void (*handover)(struct qcom_q6v5 *q6v5))
 {
int ret;
 
q6v5->rproc = rproc;
q6v5->dev = &pdev->dev;
-   q6v5->crash_reason = crash_reason;
+   q6v5->crash_reason_partition = crash_reason_partition;
+   q6v5->crash_reason_item = crash_reason_item;
q6v5->handover = handover;
 
init_completion(&q6v5->start_done);
diff --git a/drivers/remoteproc/qcom_q6v5.h b/drivers/remoteproc/qcom_q6v5.h
index 4e1bb1a68284..cd02372e9856 100644
--- a/drivers/remoteproc/qcom_q6v5.h
+++ b/drivers/remoteproc/qcom_q6v5.h
@@ -40,7 +40,8 @@ struct qcom_q6v5 {
struct completion stop_done;
struct completion spawn_done;
 
-   int crash_reason;
+   int crash_reason_partition;
+   int crash_reason_item;
 
bool running;
 
@@ -49,7 +50,8 @@ struct qcom_q6v5 {
 };
 
 int qcom_q6v5_init(struct qcom_q6v5 *q6v5, struct platform_device *pdev,
-  struct rproc *rproc, int crash_reason, const char 
*load_state,
+  struct rproc *rproc, int crash_reason_partition,
+  int crash_reason_item, const char *load_state,
   void (*handover)(struct qcom_q6v5 *q6v5));
 void qcom_q6v5_deinit(struct qcom_q6v5 *q6v5);
 
diff --git a/drivers/remoteproc/qcom_q6v5_adsp.c 
b/drivers/remoteproc/qcom_q6v5_adsp.c
index 6c67514cc493..055764aa201c 100644
--- a/drivers/remoteproc/qcom_q6v5_adsp.c
+++ b/drivers/remoteproc/qcom_q6v5_adsp.c
@@ -62,7 +62,7 @@
 #define LPASS_EFUSE_Q6SS_EVB_SEL 0x0
 
 struct adsp_pil_data {
-   int crash_reason_smem;
+   int crash_reason_item;
const char *firmware_name;
 
const char *ssr_name;
@@ -98,7 +98,7 @@ struct qcom_adsp {
struct regmap *halt_map;
unsigned int halt_lpass;
 
-   int crash_reason_smem;
+   int crash_reason_item;
const char *info_name;
 
struct completion start_done;
@@ -731,8 +731,9 @@ static int adsp_probe(struct platform_device *pdev)
if (ret)
goto disable_pm;
 
-   ret = qcom_q6v5_init(&adsp->q6v5, pdev, rproc, desc->crash_reas

Re: [PATCH] modules: wait do_free_init correctly

2023-12-19 Thread Changbin Du

On Tue, Dec 19, 2023 at 01:52:03PM -0800, Luis Chamberlain wrote:
> On Tue, Dec 19, 2023 at 12:51:51PM -0800, Andrew Morton wrote:
> > On Tue, 19 Dec 2023 22:12:31 +0800 Changbin Du  
> > wrote:
> > 
> > > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > > do_free_init() into a global workqueue instead of call_rcu(). So now
> > > we should wait it via flush_work().
> > 
> > What are the runtime effects of this change?
> 
> Indeed that's needed given how old this culprit commit is:
> 
> git describe --contains 1a7b7d922081
> v5.2-rc1~192^2~5
> 
> Who did this work and for what reason? What triggered this itch?
>
Seems the waiting was introduced by commit ae646f0b9ca ("init: fix false 
positives
in W+X checking").

As what I have observed, mark_readonly() is only invoked by the first user mode
thread function kernel_init(), which is before userspace /init. So is it real
possible we have loaded modules at this point?

Cc Jeffrey Hugo 
> Is it perhaps for an out of tree driver that did something funky
> on its module exit?
> 
> As per Documentation/RCU/rcubarrier.rst rcu_barrier will ensure the
> callbacks complete, so interms of determinism both mechanisms will
> have waited for the free. It seems we're now just limiting the scope.
> 
> This could also mean initialization grew used to having RCU calls on
> init complete at this point in time, even for modules, and so localizing
> this wait may now also introduce other unexpected behaviour.
> 
>   Luis

-- 
Cheers,
Changbin Du

Re: [PATCH RFC 0/4] virtio-net: add tx-hash, rx-tstamp, tx-tstamp and tx-time

2023-12-19 Thread Willem de Bruijn

Jason Wang wrote:
> On Tue, Dec 19, 2023 at 12:36 AM Willem de Bruijn
>  wrote:
> >
> > Steffen Trumtrar wrote:
> > > This series tries to pick up the work on the virtio-net timestamping
> > > feature from Willem de Bruijn.
> > >
> > > Original series
> > > Message-Id: 20210208185558.995292-1-willemdebruijn.ker...@gmail.com
> > > Subject: [PATCH RFC v2 0/4] virtio-net: add tx-hash, rx-tstamp,
> > > tx-tstamp and tx-time
> > > From: Willem de Bruijn 
> > >
> > > RFC for four new features to the virtio network device:
> > >
> > > 1. pass tx flow state to host, for routing + telemetry
> > > 2. pass rx tstamp to guest, for better RTT estimation
> > > 3. pass tx tstamp to guest, idem
> > > 3. pass tx delivery time to host, for accurate pacing
> > >
> > > All would introduce an extension to the virtio spec.
> > >
> > > The original series consisted of a hack around the DMA API, which should
> > > be fixed in this series.
> > >
> > > The changes in this series are to the driver side. For the changes to 
> > > qemu see:
> > > https://github.com/strumtrar/qemu/tree/v8.1.1/virtio-net-ptp
> > >
> > > Currently only virtio-net is supported. The original series used
> > > vhost-net as backend. However, the path through tun via sendmsg doesn't
> > > allow us to write data back to the driver side without any hacks.
> > > Therefore use the way via plain virtio-net without vhost albeit better
> > > performance.
> > >
> > > Signed-off-by: Steffen Trumtrar 
> >
> > Thanks for picking this back up, Steffen. Nice to see that the code still
> > applies mostly cleanly.
> >
> > For context: I dropped the work only because I had no real device
> > implementation. The referenced patch series to qemu changes that.
> >
> > I suppose the main issue is the virtio API changes that this introduces,
> > which will have to be accepted to the spec.
> >
> > One small comment to patch 4: there I just assumed the virtual device
> > time is CLOCK_TAI. There is a concurrent feature under review for HW
> > pacing offload with AF_XDP sockets. The clock issue comes up a bit. In
> > general, for hardware we cannot assume a clock.
> 
> Any reason for this? E.g some modern NIC have PTP support.

I meant that we cannot assume a specific clock, if aiming to offload
existing pacing (or "launch time") methods.

The issue discussed in the AF_XDP thread is whether to use CLOCK_TAI
or CLOCK_MONOTONIC. Both of which are already in use in software
pacing offload, in the ETF and FQ qdiscs, respectively.

But for virtio it may be acceptable to restrict to one clock, such as
CLOCK_REALTIME or CLOCK_TAI.

CLOCK_MONOTONIC being boottime is almost certain to have an offset.
Even if the clocks' rates are synchronized with phc2sys.

> > For virtio, perhaps
> > assuming the same monotonic hardware clock in guest and host can be
> > assumed.
> 
> Note that virtio can be implemented in hardware now. So we can assume
> things like the kvm ptp clock.
> 
> > But this clock alignment needs some thought.
> >
> 
> Thanks
>

Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 11:46 AM Jason Wang  wrote:
>
> On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  wrote:
> >
> > The virtio spec doesn't allow changing virtqueue addresses after
> > DRIVER_OK. Some devices do support this operation when the device is
> > suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> > advertises this support as a backend features.
>
> There's an ongoing effort in virtio spec to introduce the suspend state.
>
> So I wonder if it's better to just allow such behaviour?

Actually I mean, allow drivers to modify the parameters during suspend
without a new feature.

Thanks

>
> Thanks
>
>

Re: [PATCH] vdpa: Fix an error handling path in eni_vdpa_probe()

2023-12-19 Thread Jason Wang

On Fri, Dec 8, 2023 at 5:14 AM Christophe JAILLET
 wrote:
>
> Le 20/10/2022 à 21:21, Christophe JAILLET a écrit :
> > After a successful vp_legacy_probe() call, vp_legacy_remove() should be
> > called in the error handling path, as already done in the remove function.
> >
> > Add the missing call.
> >
> > Fixes: e85087beedca ("eni_vdpa: add vDPA driver for Alibaba ENI")
> > Signed-off-by: Christophe JAILLET 
> > ---
> >   drivers/vdpa/alibaba/eni_vdpa.c | 6 --
> >   1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/vdpa/alibaba/eni_vdpa.c 
> > b/drivers/vdpa/alibaba/eni_vdpa.c
> > index 5a09a09cca70..cce3d1837104 100644
> > --- a/drivers/vdpa/alibaba/eni_vdpa.c
> > +++ b/drivers/vdpa/alibaba/eni_vdpa.c
> > @@ -497,7 +497,7 @@ static int eni_vdpa_probe(struct pci_dev *pdev, const 
> > struct pci_device_id *id)
> >   if (!eni_vdpa->vring) {
> >   ret = -ENOMEM;
> >   ENI_ERR(pdev, "failed to allocate virtqueues\n");
> > - goto err;
> > + goto err_remove_vp_legacy;
> >   }
> >
> >   for (i = 0; i < eni_vdpa->queues; i++) {
> > @@ -509,11 +509,13 @@ static int eni_vdpa_probe(struct pci_dev *pdev, const 
> > struct pci_device_id *id)
> >   ret = vdpa_register_device(&eni_vdpa->vdpa, eni_vdpa->queues);
> >   if (ret) {
> >   ENI_ERR(pdev, "failed to register to vdpa bus\n");
> > - goto err;
> > + goto err_remove_vp_legacy;
> >   }
> >
> >   return 0;
> >
> > +err_remove_vp_legacy:
> > + vp_legacy_remove(&eni_vdpa->ldev);
> >   err:
> >   put_device(&eni_vdpa->vdpa.dev);
> >   return ret;
>
> Polite reminder on a (very) old patch.

Acked-by: Jason Wang 

Thanks

>
> CJ
>

Re: [PATCH] virtio_pmem: support feature SHMEM_REGION

2023-12-19 Thread Jason Wang

On Tue, Dec 19, 2023 at 3:19 PM Changyuan Lyu  wrote:
>
> As per virtio spec 1.2 section 5.19.5.2, if the feature
> VIRTIO_PMEM_F_SHMEM_REGION has been negotiated, the driver MUST query
> shared memory ID 0 for the physical address ranges.
>
> Signed-off-by: Changyuan Lyu 
> ---
>  drivers/nvdimm/virtio_pmem.c | 29 +
>  include/uapi/linux/virtio_pmem.h |  8 
>  2 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> index a92eb172f0e7..5b28d543728b 100644
> --- a/drivers/nvdimm/virtio_pmem.c
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -35,6 +35,8 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> struct nd_region *nd_region;
> struct virtio_pmem *vpmem;
> struct resource res;
> +   struct virtio_shm_region shm_reg;
> +   bool have_shm;
> int err = 0;
>
> if (!vdev->config->get) {
> @@ -57,10 +59,23 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> goto out_err;
> }
>
> -   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> -   start, &vpmem->start);
> -   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> -   size, &vpmem->size);
> +   if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION)) {
> +   have_shm = virtio_get_shm_region(vdev, &shm_reg,
> +   (u8)VIRTIO_PMEM_SHMCAP_ID);
> +   if (!have_shm) {
> +   dev_err(&vdev->dev, "failed to get shared memory 
> region %d\n",
> +   VIRTIO_PMEM_SHMCAP_ID);
> +   return -EINVAL;
> +   }
> +   vpmem->start = shm_reg.addr;
> +   vpmem->size = shm_reg.len;
> +   } else {
> +   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> +   start, &vpmem->start);
> +   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> +   size, &vpmem->size);
> +   }
> +
>
> res.start = vpmem->start;
> res.end   = vpmem->start + vpmem->size - 1;
> @@ -122,7 +137,13 @@ static void virtio_pmem_remove(struct virtio_device 
> *vdev)
> virtio_reset_device(vdev);
>  }
>
> +static unsigned int features[] = {
> +   VIRTIO_PMEM_F_SHMEM_REGION,
> +};
> +
>  static struct virtio_driver virtio_pmem_driver = {
> +   .feature_table  = features,
> +   .feature_table_size = ARRAY_SIZE(features),
> .driver.name= KBUILD_MODNAME,
> .driver.owner   = THIS_MODULE,
> .id_table   = id_table,
> diff --git a/include/uapi/linux/virtio_pmem.h 
> b/include/uapi/linux/virtio_pmem.h
> index d676b3620383..025174f6eacf 100644
> --- a/include/uapi/linux/virtio_pmem.h
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -14,6 +14,14 @@
>  #include 
>  #include 
>
> +/* Feature bits */
> +#define VIRTIO_PMEM_F_SHMEM_REGION 0   /* guest physical address range will 
> be
> +* indicated as shared memory region 0
> +*/
> +
> +/* shmid of the shared memory region corresponding to the pmem */
> +#define VIRTIO_PMEM_SHMCAP_ID 0

NIT: not a native speaker, but any reason for "CAP" here? Would it be
better to use SHMMEM_REGION_ID?

Thanks

> +
>  struct virtio_pmem_config {
> __le64 start;
> __le64 size;
> --
> 2.43.0.472.g3155946c3a-goog
>

Re: [PATCH] vdpa: Remove usage of the deprecated ida_simple_xx() API

2023-12-19 Thread Jason Wang

On Mon, Dec 11, 2023 at 1:52 AM Christophe JAILLET
 wrote:
>
> ida_alloc() and ida_free() should be preferred to the deprecated
> ida_simple_get() and ida_simple_remove().
>
> This is less verbose.
>
> Signed-off-by: Christophe JAILLET 

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/vdpa/vdpa.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index a7612e0783b3..d0695680b282 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -131,7 +131,7 @@ static void vdpa_release_dev(struct device *d)
> if (ops->free)
> ops->free(vdev);
>
> -   ida_simple_remove(&vdpa_index_ida, vdev->index);
> +   ida_free(&vdpa_index_ida, vdev->index);
> kfree(vdev->driver_override);
> kfree(vdev);
>  }
> @@ -205,7 +205,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device 
> *parent,
> return vdev;
>
>  err_name:
> -   ida_simple_remove(&vdpa_index_ida, vdev->index);
> +   ida_free(&vdpa_index_ida, vdev->index);
>  err_ida:
> kfree(vdev);
>  err:
> --
> 2.34.1
>

Re: [PATCH v5 2/4] vduse: Temporarily disable control queue features

2023-12-19 Thread Jason Wang

On Mon, Dec 18, 2023 at 5:21 PM Maxime Coquelin
 wrote:
>
>
>
> On 12/18/23 03:50, Jason Wang wrote:
> > On Wed, Dec 13, 2023 at 7:23 PM Maxime Coquelin
> >  wrote:
> >>
> >> Hi Jason,
> >>
> >> On 12/13/23 05:52, Jason Wang wrote:
> >>> On Tue, Dec 12, 2023 at 9:17 PM Maxime Coquelin
> >>>  wrote:
> 
>  Virtio-net driver control queue implementation is not safe
>  when used with VDUSE. If the VDUSE application does not
>  reply to control queue messages, it currently ends up
>  hanging the kernel thread sending this command.
> 
>  Some work is on-going to make the control queue
>  implementation robust with VDUSE. Until it is completed,
>  let's disable control virtqueue and features that depend on
>  it.
> 
>  Signed-off-by: Maxime Coquelin 
> >>>
> >>> I wonder if it's better to fail instead of a mask as a start.
> >>
> >> I think it is better to use a mask and not fail, so that we can in the
> >> future use a recent VDUSE application with an older kernel.
> >
> > It may confuse the userspace unless userspace can do post check after
> > CREATE_DEV.
> >
> > And for blk we fail when WCE is set in feature_is_valid():
> >
> > static bool features_is_valid(u64 features)
> > {
> >  if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> >  return false;
> >
> >  /* Now we only support read-only configuration space */
> >  if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
> >  return false;
> >
> >  return true;
> > }
>
> Ok, consistency with other devices types is indeed better.
>
> But should I fail if any of the feature advertised by the application is
> not listed by the VDUSE driver, or just fail if control queue is being
> advertised by the application?

Maybe it's better to fail for any other of the features that depend on
the control vq.

Thanks

>
> Thanks,
> Maxime
>
> > Thanks
> >
> >>
> >> Why would it be better to fail than negotiating?
> >>
> >> Thanks,
> >> Maxime
> >>
> >
>

Re: [PATCH RFC 0/4] virtio-net: add tx-hash, rx-tstamp, tx-tstamp and tx-time

2023-12-19 Thread Jason Wang

On Tue, Dec 19, 2023 at 12:36 AM Willem de Bruijn
 wrote:
>
> Steffen Trumtrar wrote:
> > This series tries to pick up the work on the virtio-net timestamping
> > feature from Willem de Bruijn.
> >
> > Original series
> > Message-Id: 20210208185558.995292-1-willemdebruijn.ker...@gmail.com
> > Subject: [PATCH RFC v2 0/4] virtio-net: add tx-hash, rx-tstamp,
> > tx-tstamp and tx-time
> > From: Willem de Bruijn 
> >
> > RFC for four new features to the virtio network device:
> >
> > 1. pass tx flow state to host, for routing + telemetry
> > 2. pass rx tstamp to guest, for better RTT estimation
> > 3. pass tx tstamp to guest, idem
> > 3. pass tx delivery time to host, for accurate pacing
> >
> > All would introduce an extension to the virtio spec.
> >
> > The original series consisted of a hack around the DMA API, which should
> > be fixed in this series.
> >
> > The changes in this series are to the driver side. For the changes to qemu 
> > see:
> > https://github.com/strumtrar/qemu/tree/v8.1.1/virtio-net-ptp
> >
> > Currently only virtio-net is supported. The original series used
> > vhost-net as backend. However, the path through tun via sendmsg doesn't
> > allow us to write data back to the driver side without any hacks.
> > Therefore use the way via plain virtio-net without vhost albeit better
> > performance.
> >
> > Signed-off-by: Steffen Trumtrar 
>
> Thanks for picking this back up, Steffen. Nice to see that the code still
> applies mostly cleanly.
>
> For context: I dropped the work only because I had no real device
> implementation. The referenced patch series to qemu changes that.
>
> I suppose the main issue is the virtio API changes that this introduces,
> which will have to be accepted to the spec.
>
> One small comment to patch 4: there I just assumed the virtual device
> time is CLOCK_TAI. There is a concurrent feature under review for HW
> pacing offload with AF_XDP sockets. The clock issue comes up a bit. In
> general, for hardware we cannot assume a clock.

Any reason for this? E.g some modern NIC have PTP support.

> For virtio, perhaps
> assuming the same monotonic hardware clock in guest and host can be
> assumed.

Note that virtio can be implemented in hardware now. So we can assume
things like the kvm ptp clock.

> But this clock alignment needs some thought.
>

Thanks

Re: [PATCH vhost v4 14/15] vdpa/mlx5: Introduce reference counting to mrs

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:10 AM Dragos Tatulea  wrote:
>
> Deleting the old mr during mr update (.set_map) and then modifying the
> vqs with the new mr is not a good flow for firmware. The firmware
> expects that mkeys are deleted after there are no more vqs referencing
> them.
>
> Introduce reference counting for mrs to fix this. It is the only way to
> make sure that mkeys are not in use by vqs.
>
> An mr reference is taken when the mr is associated to the mr asid table
> and when the mr is linked to the vq on create/modify. The reference is
> released when the mkey is unlinked from the vq (trough modify/destroy)
> and from the mr asid table.
>
> To make things consistent, get rid of mlx5_vdpa_destroy_mr and use
> get/put semantics everywhere.
>
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH vhost v4 13/15] vdpa/mlx5: Use vq suspend/resume during .set_map

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:10 AM Dragos Tatulea  wrote:
>
> Instead of tearing down and setting up vq resources, use vq
> suspend/resume during .set_map to speed things up a bit.
>
> The vq mr is updated with the new mapping while the vqs are suspended.
>
> If the device doesn't support resumable vqs, do the old teardown and
> setup dance.
>
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH vhost v4 12/15] vdpa/mlx5: Mark vq state for modification in hw vq

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:10 AM Dragos Tatulea  wrote:
>
> .set_vq_state will set the indices and mark the fields to be modified in
> the hw vq.
>
> Advertise that the device supports changing the vq state when the device
> is in DRIVER_OK state and suspended.
>
> Reviewed-by: Gal Pressman 
> Signed-off-by: Dragos Tatulea 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH vhost v4 10/15] vdpa/mlx5: Introduce per vq and device resume

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:10 AM Dragos Tatulea  wrote:
>
> Implement vdpa vq and device resume if capability detected. Add support
> for suspend -> ready state change.
>
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks

Re: [PATCH vhost v4 09/15] vdpa/mlx5: Allow modifying multiple vq fields in one modify command

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:10 AM Dragos Tatulea  wrote:
>
> Add a bitmask variable that tracks hw vq field changes that
> are supposed to be modified on next hw vq change command.
>
> This will be useful to set multiple vq fields when resuming the vq.
>
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks

Re: [PATCH vhost v4 06/15] vdpa: Track device suspended state

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  wrote:
>
> Set vdpa device suspended state on successful suspend. Clear it on
> successful resume and reset.
>
> The state will be locked by the vhost_vdpa mutex. The mutex is taken
> during suspend, resume and reset in vhost_vdpa_unlocked_ioctl. The
> exception is vhost_vdpa_open which does a device reset but that should
> be safe because it can only happen before the other ops.
>
> Signed-off-by: Dragos Tatulea 
> Suggested-by: Eugenio Pérez 
> ---
>  drivers/vhost/vdpa.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index b4e8ddf86485..00b4fa8e89f2 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -59,6 +59,7 @@ struct vhost_vdpa {
> int in_batch;
> struct vdpa_iova_range range;
> u32 batch_asid;
> +   bool suspended;

Any reason why we don't do it in the core vDPA device but here?

Thanks

Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  wrote:
>
> The virtio spec doesn't allow changing virtqueue addresses after
> DRIVER_OK. Some devices do support this operation when the device is
> suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> advertises this support as a backend features.

There's an ongoing effort in virtio spec to introduce the suspend state.

So I wonder if it's better to just allow such behaviour?

Thanks


>
> Signed-off-by: Dragos Tatulea 
> Suggested-by: Eugenio Pérez 
> ---
>  include/uapi/linux/vhost_types.h | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/uapi/linux/vhost_types.h 
> b/include/uapi/linux/vhost_types.h
> index d7656908f730..aacd067afc89 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -192,5 +192,9 @@ struct vhost_vdpa_iova_range {
>  #define VHOST_BACKEND_F_DESC_ASID0x7
>  /* IOTLB don't flush memory mapping across device reset */
>  #define VHOST_BACKEND_F_IOTLB_PERSIST  0x8
> +/* Device supports changing virtqueue addresses when device is suspended
> + * and is in state DRIVER_OK.
> + */
> +#define VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND  0x9
>
>  #endif
> --
> 2.43.0
>

Re: [PATCH mlx5-vhost v4 01/15] vdpa/mlx5: Expose resumable vq capability

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  wrote:
>
> Necessary for checking if resumable vqs are supported by the hardware.
> Actual support will be added in a downstream patch.
>
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks


> ---
>  include/linux/mlx5/mlx5_ifc.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
> index 6f3631425f38..9eaceaf6bcb0 100644
> --- a/include/linux/mlx5/mlx5_ifc.h
> +++ b/include/linux/mlx5/mlx5_ifc.h
> @@ -1236,7 +1236,8 @@ struct mlx5_ifc_virtio_emulation_cap_bits {
>
> u8 reserved_at_c0[0x13];
> u8 desc_group_mkey_supported[0x1];
> -   u8 reserved_at_d4[0xc];
> +   u8 freeze_to_rdy_supported[0x1];
> +   u8 reserved_at_d5[0xb];
>
> u8 reserved_at_e0[0x20];
>
> --
> 2.43.0
>

[PATCH -next v4 2/2] mm: vmscan: add new event to trace shrink lru

2023-12-19 Thread Bixuan Cui

From: cuibixuan 

Add a new event to calculate the shrink_inactive_list()/shrink_active_list()
execution time.

Example of output:
 kswapd0-103 [007] .  1098.353020: 
mm_vmscan_lru_shrink_active_start: nid=0
 kswapd0-103 [007] .  1098.353040: 
mm_vmscan_lru_shrink_active_end: nid=0 nr_taken=32 nr_active=0 
nr_deactivated=32 nr_referenced=0 priority=6 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC
 kswapd0-103 [007] .  1098.353040: 
mm_vmscan_lru_shrink_inactive_start: nid=0
 kswapd0-103 [007] .  1098.353094: 
mm_vmscan_lru_shrink_inactive_end: nid=0 nr_scanned=32 nr_reclaimed=0 
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 
nr_activate_file=0 nr_ref_keep=32 nr_unmap_fail=0 priority=6 
flags=RECLAIM_WB_ANON|RECLAIM_WB_ASYNC
 kswapd0-103 [007] .  1098.353094: 
mm_vmscan_lru_shrink_inactive_start: nid=0
 kswapd0-103 [007] .  1098.353162: 
mm_vmscan_lru_shrink_inactive_end: nid=0 nr_scanned=32 nr_reclaimed=21 
nr_dirty=0 nr_writeback=0 nr_congested=0 nr_immediate=0 nr_activate_anon=0 
nr_activate_file=0 nr_ref_keep=11 nr_unmap_fail=0 priority=6 
flags=RECLAIM_WB_FILE|RECLAIM_WB_ASYNC

Signed-off-by: Bixuan Cui 
Reviewed-by: Andrew Morton 
---
v4: Add Reviewed-by and Changlog to every patch.
v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same 
time to fix build error.

 include/trace/events/vmscan.h | 38 +--
 mm/vmscan.c   | 11 +++---
 2 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index b99cd28c9815..02868bdc5999 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -395,7 +395,24 @@ TRACE_EVENT(mm_vmscan_write_folio,
show_reclaim_flags(__entry->reclaim_flags))
 );
 
-TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive_start,
+
+   TP_PROTO(int nid),
+
+   TP_ARGS(nid),
+
+   TP_STRUCT__entry(
+   __field(int, nid)
+   ),
+
+   TP_fast_assign(
+   __entry->nid = nid;
+   ),
+
+   TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_lru_shrink_inactive_end,
 
TP_PROTO(int nid,
unsigned long nr_scanned, unsigned long nr_reclaimed,
@@ -446,7 +463,24 @@ TRACE_EVENT(mm_vmscan_lru_shrink_inactive,
show_reclaim_flags(__entry->reclaim_flags))
 );
 
-TRACE_EVENT(mm_vmscan_lru_shrink_active,
+TRACE_EVENT(mm_vmscan_lru_shrink_active_start,
+
+   TP_PROTO(int nid),
+
+   TP_ARGS(nid),
+
+   TP_STRUCT__entry(
+   __field(int, nid)
+   ),
+
+   TP_fast_assign(
+   __entry->nid = nid;
+   ),
+
+   TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_lru_shrink_active_end,
 
TP_PROTO(int nid, unsigned long nr_taken,
unsigned long nr_active, unsigned long nr_deactivated,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4e3b835c6b4a..a44d9624d60f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1906,6 +1906,8 @@ static unsigned long shrink_inactive_list(unsigned long 
nr_to_scan,
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
bool stalled = false;
 
+   trace_mm_vmscan_lru_shrink_inactive_start(pgdat->node_id);
+
while (unlikely(too_many_isolated(pgdat, file, sc))) {
if (stalled)
return 0;
@@ -1990,7 +1992,7 @@ static unsigned long shrink_inactive_list(unsigned long 
nr_to_scan,
if (file)
sc->nr.file_taken += nr_taken;
 
-   trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
+   trace_mm_vmscan_lru_shrink_inactive_end(pgdat->node_id,
nr_scanned, nr_reclaimed, &stat, sc->priority, file);
return nr_reclaimed;
 }
@@ -2028,6 +2030,8 @@ static void shrink_active_list(unsigned long nr_to_scan,
int file = is_file_lru(lru);
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 
+   trace_mm_vmscan_lru_shrink_active_start(pgdat->node_id);
+
lru_add_drain();
 
spin_lock_irq(&lruvec->lru_lock);
@@ -2107,7 +2111,7 @@ static void shrink_active_list(unsigned long nr_to_scan,
lru_note_cost(lruvec, file, 0, nr_rotated);
mem_cgroup_uncharge_list(&l_active);
free_unref_page_list(&l_active);
-   trace_mm_vmscan_lru_shrink_active(pgdat->node_id, nr_taken, nr_activate,
+   trace_mm_vmscan_lru_shrink_active_end(pgdat->node_id, nr_taken, 
nr_activate,
nr_deactivate, nr_rotated, sc->priority, file);
 }
 
@@ -4524,9 +4528,10 @@ static int evict_folios(struct lruvec *lruvec, struct 
scan_control *sc, int swap
if (list_empty(&list))
return scanned;
 retry:
+   trace_mm_vmscan_lru_shrink_inactive_start(pgdat->node_id);
reclaimed = shrink_folio_list(&list, pgdat, sc

[PATCH -next v4 1/2] mm: shrinker: add new event to trace shrink count

2023-12-19 Thread Bixuan Cui

From: cuibixuan 

do_shrink_slab() calculates the freeable memory through 
shrinker->count_objects(),
and then reclaims the memory through shrinker->scan_objects(). When reclaiming
memory, shrinker->count_objects() takes a certain amount of time:

Fun   spend(us)
ext4_es_count 4302
ext4_es_scan  12
super_cache_count 4195
super_cache_scan  2103

Therefore, adding the trace event to count_objects() can more accurately
obtain the time taken for slab memory recycling.

Example of output:
 kswapd0-103 [003] .  1098.317942: mm_shrink_count_start: 
kfree_rcu_shrink_count.cfi_jt+0x0/0x8 c540ff51: nid: 0
 kswapd0-103 [003] .  1098.317951: mm_shrink_count_end: 
kfree_rcu_shrink_count.cfi_jt+0x0/0x8 c540ff51: nid: 0 freeable:36

Signed-off-by: Bixuan Cui 
Reviewed-by: Steven Rostedt 
---
v4: Add Reviewed-by and Changlog to every patch.
v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace 
event.

 include/trace/events/vmscan.h | 49 +++
 mm/shrinker.c |  4 +++
 2 files changed, 53 insertions(+)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 1a488c30afa5..b99cd28c9815 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -196,6 +196,55 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, 
mm_vmscan_memcg_softlimit_re
 );
 #endif /* CONFIG_MEMCG */
 
+TRACE_EVENT(mm_shrink_count_start,
+   TP_PROTO(struct shrinker *shr, struct shrink_control *sc),
+
+   TP_ARGS(shr, sc),
+
+   TP_STRUCT__entry(
+   __field(struct shrinker *, shr)
+   __field(void *, shrink)
+   __field(int, nid)
+   ),
+
+   TP_fast_assign(
+   __entry->shr = shr;
+   __entry->shrink = shr->count_objects;
+   __entry->nid = sc->nid;
+   ),
+
+   TP_printk("%pS %p: nid: %d",
+   __entry->shrink,
+   __entry->shr,
+   __entry->nid)
+);
+
+TRACE_EVENT(mm_shrink_count_end,
+   TP_PROTO(struct shrinker *shr, struct shrink_control *sc, long 
freeable),
+
+   TP_ARGS(shr, sc, freeable),
+
+   TP_STRUCT__entry(
+   __field(struct shrinker *, shr)
+   __field(void *, shrink)
+   __field(long, freeable)
+   __field(int, nid)
+   ),
+
+   TP_fast_assign(
+   __entry->shr = shr;
+   __entry->shrink = shr->count_objects;
+   __entry->freeable = freeable;
+   __entry->nid = sc->nid;
+   ),
+
+   TP_printk("%pS %p: nid: %d freeable:%ld",
+   __entry->shrink,
+   __entry->shr,
+   __entry->nid,
+   __entry->freeable)
+);
+
 TRACE_EVENT(mm_shrink_slab_start,
TP_PROTO(struct shrinker *shr, struct shrink_control *sc,
long nr_objects_to_shrink, unsigned long cache_items,
diff --git a/mm/shrinker.c b/mm/shrinker.c
index dd91eab43ed3..d0c7bf61db61 100644
--- a/mm/shrinker.c
+++ b/mm/shrinker.c
@@ -379,7 +379,11 @@ static unsigned long do_shrink_slab(struct shrink_control 
*shrinkctl,
  : SHRINK_BATCH;
long scanned = 0, next_deferred;
 
+   trace_mm_shrink_count_start(shrinker, shrinkctl);
+
freeable = shrinker->count_objects(shrinker, shrinkctl);
+
+   trace_mm_shrink_count_end(shrinker, shrinkctl, freeable);
if (freeable == 0 || freeable == SHRINK_EMPTY)
return freeable;
 
-- 
2.17.1

[PATCH -next v4 0/2] Make memory reclamation measurable

2023-12-19 Thread Bixuan Cui

When the system memory is low, kswapd reclaims the memory. The key steps
of memory reclamation include
1.shrink_lruvec
  * shrink_active_list, moves folios from the active LRU to the inactive LRU
  * shrink_inactive_list, shrink lru from inactive LRU list
2.shrink_slab
  * shrinker->count_objects(), calculates the freeable memory
  * shrinker->scan_objects(), reclaims the slab memory

The existing tracers in the vmscan are as follows:

--do_try_to_free_pages
--shrink_zones
--trace_mm_vmscan_node_reclaim_begin (tracer)
--shrink_node
--shrink_node_memcgs
  --trace_mm_vmscan_memcg_shrink_begin (tracer)
  --shrink_lruvec
--shrink_list
  --shrink_active_list
  --trace_mm_vmscan_lru_shrink_active (tracer)
  --shrink_inactive_list
  --trace_mm_vmscan_lru_shrink_inactive (tracer)
--shrink_active_list
  --shrink_slab
--do_shrink_slab
--shrinker->count_objects()
--trace_mm_shrink_slab_start (tracer)
--shrinker->scan_objects()
--trace_mm_shrink_slab_end (tracer)
  --trace_mm_vmscan_memcg_shrink_end (tracer)
--trace_mm_vmscan_node_reclaim_end (tracer)

If we get the duration and quantity of shrink lru and slab,
then we can measure the memory recycling, as follows

Measuring memory reclamation with bpf:
  LRU FILE:
CPU COMMShrinkActive(us) ShrinkInactive(us)  Reclaim(page)
7   kswapd0 26  51  32
7   kswapd0 52  47  13
  SLAB:
CPU COMMOBJ_NAMECount_Dur(us) 
Freeable(page) Scan_Dur(us) Reclaim(page)
 1  kswapd0 super_cache_scan.cfi_jt 2   341 
   3225 128
 7  kswapd0 super_cache_scan.cfi_jt 0   
2247   8524 1024
 7  kswapd0 super_cache_scan.cfi_jt 23670   
   00

For this, add the new tracer to shrink_active_list/shrink_inactive_list
and shrinker->count_objects().

Changes:
v4: Add Reviewed-by and Changlog to every patch.
v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace 
event.
v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same 
time to fix build error.

cuibixuan (2):
  mm: shrinker: add new event to trace shrink count
  mm: vmscan: add new event to trace shrink lru

 include/trace/events/vmscan.h | 87 ++-
 mm/shrinker.c |  4 ++
 mm/vmscan.c   | 11 +++--
 3 files changed, 97 insertions(+), 5 deletions(-)

-- 
2.17.1

Re: [PATCH v5 28/34] fprobe: Rewrite fprobe on function-graph tracer

2023-12-19 Thread Google

On Tue, 19 Dec 2023 15:39:03 +0100
Jiri Olsa  wrote:

> On Mon, Dec 18, 2023 at 10:17:10PM +0900, Masami Hiramatsu (Google) wrote:
> 
> SNIP
> 
> > -static void fprobe_exit_handler(struct rethook_node *rh, void *data,
> > -   unsigned long ret_ip, struct pt_regs *regs)
> > +static int fprobe_entry(unsigned long func, unsigned long ret_ip,
> > +   struct ftrace_regs *fregs, struct fgraph_ops *gops)
> >  {
> > -   struct fprobe *fp = (struct fprobe *)data;
> > -   struct fprobe_rethook_node *fpr;
> > -   struct ftrace_regs *fregs = (struct ftrace_regs *)regs;
> > -   int bit;
> > +   struct fprobe_hlist_node *node, *first;
> > +   unsigned long *fgraph_data = NULL;
> > +   unsigned long header;
> > +   int reserved_words;
> > +   struct fprobe *fp;
> > +   int used, ret;
> >  
> > -   if (!fp || fprobe_disabled(fp))
> > -   return;
> > +   if (WARN_ON_ONCE(!fregs))
> > +   return 0;
> >  
> > -   fpr = container_of(rh, struct fprobe_rethook_node, node);
> > +   first = node = find_first_fprobe_node(func);
> > +   if (unlikely(!first))
> > +   return 0;
> > +
> > +   reserved_words = 0;
> > +   hlist_for_each_entry_from_rcu(node, hlist) {
> > +   if (node->addr != func)
> > +   break;
> > +   fp = READ_ONCE(node->fp);
> > +   if (!fp || !fp->exit_handler)
> > +   continue;
> > +   /*
> > +* Since fprobe can be enabled until the next loop, we ignore 
> > the
> > +* fprobe's disabled flag in this loop.
> > +*/
> > +   reserved_words +=
> > +   DIV_ROUND_UP(fp->entry_data_size, sizeof(long)) + 1;
> > +   }
> > +   node = first;
> > +   if (reserved_words) {
> > +   fgraph_data = fgraph_reserve_data(gops->idx, reserved_words * 
> > sizeof(long));
> > +   if (unlikely(!fgraph_data)) {
> > +   hlist_for_each_entry_from_rcu(node, hlist) {
> > +   if (node->addr != func)
> > +   break;
> > +   fp = READ_ONCE(node->fp);
> > +   if (fp && !fprobe_disabled(fp))
> > +   fp->nmissed++;
> > +   }
> > +   return 0;
> > +   }
> > +   }
> 
> this looks expensive compared to what we do now.. IIUC due to the graph
> ops limitations (16 ctive ops), you have just single graph ops for fprobe
> and each fprobe registration stores ips into hash which you need to search
> in here to get registered callbacks..?

I think this is not so expensive. Most cases, it only hits 1 fprobe on the
hash. And if the fprobe is only used to hook the entry, reserved_words == 0.

> I wonder would it make sense to allow arbitrary number of active graph_ops
> with the price some callback might fail because there's no stack space so
> each fprobe instance would have its own graph_ops.. and we would get rid
> of the code above (and below) ?

Yeah, actually my first implementation is that. But I realized that doesn't
work, this requires intermediate object which has refcounter because the
"marker" on the shadow stack will be left after unregistering it. We need to
identify which is still available and which is not. And for that purpose,
we may need to introduce similar structure in the fgraph too.

The current multi-fgraph does;

 - if CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=n (called from dedicated mcount
   asm code), it has to loop on all fgraph_ops and check the hash, which is
   inefficient but it can easily push the return trace entry on the shadow
   stack.

 - if CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS=y (called from ftrace asm code),
   it does not need to loop (that will be done by ftrace) but each handler
   does NOT know who pushed the return trace entry on the shadow stack.
   Thus it has to decode the shadow stack and check it needs to push return
   trace entry or not. And this is hard if the traced function is self-
   recursive call or tail call. To check the recursive call, I introduced
   a bitmap entry on the shadow stack. This bitmap size limits the max
   number of fgraph.

So, unlimit the number of fgraph, we may need to stack the number of fgraph
on the stack and each fgraph callback has to unwind the shadow stack to check
whether their own number is there instead of checking the bit in the bitmap.
That will be more trusted way but maybe slow.

Another option is introducing a pair of pre- and post-callbacks which is called
before and after calling the list/direct call of ftrace_ops. And pre-callback
will push the ret_stack on shadow stack and post-callback will commit or cancel 
it.
(but this one is hard to design... maybe becomes ugly interface.)

Thank you,

-- 
Masami Hiramatsu (Google)

Re: [PATCH] ring-buffer: Fix slowpath of interrupted event

2023-12-19 Thread Google

On Tue, 19 Dec 2023 10:10:27 -0500
Steven Rostedt  wrote:

> On Tue, 19 Dec 2023 23:37:10 +0900
> Masami Hiramatsu (Google)  wrote:
> 
> > Yeah the above works, but my question is, do we really need this
> > really slow path? I mean;
> > 
> > >   if (w == write - event length) {
> > >   /* Nothing interrupted between A and C */
> > >  /*E*/write_stamp = ts;
> > >   delta = ts - after  
> > 
> > } else {
> > /*
> >   Something interrupted between A and C, which should record
> >   a new entry before this reserved entry with newer timestamp.
> >   we reuse it.
> >  */
> > ts = after = write_stamp;
> > delta = 0;
> > }
> > 
> > Isn't this enough?
> 
> I really like to avoid: delta = 0 when possible. It's basically what I do
> when I have no other options. Why?
> 
> Because let's just say you are looking at the time of interrupt events. If
> you just trace the entry of the interrupt, and that interrupt interrupted
> an event being written to, we have this:
> 
> Time starts at ts 1000, and we are able to calculate the delta of the
> interrupted event. And the trace will have:
> 
>  1000 - interrupt event
>  2000 - normal context event
>  2100 - next normal context event
> 
> Where we see the delta between the interrupt event and the normal context
> event was 1000. But if we just had it be delta = 0, it would be:
> 
>  1000 - interrupt event
>  1000 - normal context event
>  2100 - next normal context event
> 
> It will look like the time between the interrupt event and the normal
> context event was instant, and the time between the normal context event
> and the next normal context event was 1100 when in reality it was just 100.

Ah, OK. interrupt event can be the beginning of the interrupt handling
and normal context event seems not interrupted from the traced log.
OK, then we have to adjust the ts of normal context event. (interrupted
after reserving the event is OK because user can observe the interrupt
event on the log between normal context event and next one.)

Reveiewed-by: Masami Hiramatsu (Google) 

Thank you!

> 
> The above scenario is rather common. Perhaps it happens 1% of the time. The
> case where we currently have delta = 0 only happens when the same event
> gets interrupted twice. That is, two separate interrupts came in, one
> before it allocated its space on the buffer, and one after it allocated.
> That's a much more race occurrence (0.01% possibly, or less). In fact my
> traces seldom even show it. Most of the time, even when doing function
> tracing, I have a hard time seeing this rare race.
> 
> So, if we can have delta=0 only 0.01% or less of the time, I rather do that
> then have it be delta=0 1% of the time.
> 
> Thanks for the review!
> 
> -- Steve


-- 
Masami Hiramatsu (Google)

Re: [PATCH] [v2] nvdimm-btt: fix several memleaks

2023-12-19 Thread Ira Weiny

dinghao.liu@ wrote:
> > Ira Weiny wrote:
> > > Dinghao Liu wrote:
> > 
> > [snip]
> > 
> > -static int btt_freelist_init(struct arena_info *arena)
> > +static int btt_freelist_init(struct device *dev, struct arena_info *arena)
> > 
> > Both struct arena_info and struct btt contain references to struct nd_btt
> > which is the device you are passing down this call stack.
> > 
> > Just use the device in the arena/btt rather than passing a device
> > structure.  That makes the code easier to read and ensures that the device
> > associated with this arena or btt is used.
> 
> Thanks for this suggestion! I will fix this in the v3 patch.
> 
> > [snip]
> > > >  
> > > > -static int btt_maplocks_init(struct arena_info *arena)
> > > > +static int btt_maplocks_init(struct device *dev, struct arena_info 
> > > > *arena)
> > > >  {
> > > > u32 i;
> > > >  
> > > > -   arena->map_locks = kcalloc(arena->nfree, sizeof(struct 
> > > > aligned_lock),
> > > > +   arena->map_locks = devm_kcalloc(dev, arena->nfree, 
> > > > sizeof(struct aligned_lock),
> > > > GFP_KERNEL);
> > > > if (!arena->map_locks)
> > > > return -ENOMEM;
> > > > @@ -805,9 +805,6 @@ static void free_arenas(struct btt *btt)
> > > >  
> > > > list_for_each_entry_safe(arena, next, &btt->arena_list, list) {
> > > > list_del(&arena->list);
> > > > -   kfree(arena->rtt);
> > > > -   kfree(arena->map_locks);
> > > > -   kfree(arena->freelist);
> > > 
> > > This does not quite work.
> > > 
> > > free_arenas() is used in the error paths of create_arenas() and
> > > discover_arenas().  In those cases devm_kfree() is probably a better way
> > > to clean up this.
> 
> Here I'm a little confused about when devm_kfree() should be used.

Over all it should be used whenever memory is allocated for the lifetime
of the device.

> Code in btt_init() implies that resources allocated by devm_* could be
> auto freed in both error and success paths of probe/attach (e.g., btt 
> allocated by devm_kzalloc is never freed by devm_kfree).
> Using devm_kfree() in free_arenas() is ok for me, but I want to know
> whether not using devm_kfree() constitutes a bug.

Unfortunately I'm not familiar enough with this code to know for sure.

After my quick checks before I thought it was.  But each time I look at it
I get confused.  This is why I was thinking maybe not using devm_*() and
using no_free_ptr() may be a better solution to make sure things don't
leak without changing the success path (which is likely working fine
because no bugs have been found.)

> 
> > > 
> > > However...
> > > 
> > > > debugfs_remove_recursive(arena->debugfs_dir);
> > > > kfree(arena);
> > > 
> > > Why can't arena be allocated with devm_*?
> > > 
> > > We need to change this up a bit more to handle the error path vs regular
> > > device shutdown free (automatic devm frees).
> > 
> 
> At first, I think the use of arena is correct. Therefore, allocating arena
> with devm_* should be a code style optimization. However, I rechecked
> discover_arenas and found arena might also be leaked (e.g., if
> alloc_arena() fails in the second loop, the previously allocated
> resources in the loop is leaked). The correct code could be found in
> create_arenas(), where free_arenas is called on failure of
> alloc_arena().
 
Yea I've not reached that level of detail in my analysis.

>
> To fix this issue, I think it's better to change the 'goto out_super'
> tag to 'goto out'. We could also use devm_* for arena to simplify the
> error path in discover_arenas(). 

I think it is your call at this point as I don't have time to dig in more
than I have.  Sorry.

> 
> > We might want to look at using no_free_ptr() in this code.  See this
> > patch[1] for an example of how to inhibit the cleanup and pass the
> > pointer on when the function succeeds.
> > 
> > [1]
> > https://lore.kernel.org/all/170261791914.1714654.6447680285357545638.st...@dwillia2-xfh.jf.intel.com/
> > 
> > Ira
> 
> Thanks for this example. But it seems that no_free_ptr() is used to
> handle the scope based resource management. Changes in this patch does
> not introduce this feature. Do I understand this correctly?

You do understand but I was thinking that perhaps using no_free_ptr()
rather than devm_*() might be an easier way to fix this bug without trying
to decode the lifetime of everything.

Ira

> 
> Regards,
> Dinghao

Re: [PATCH v5 24/34] fprobe: Use ftrace_regs in fprobe entry handler

2023-12-19 Thread Google

On Tue, 19 Dec 2023 14:23:48 +0100
Jiri Olsa  wrote:

> On Mon, Dec 18, 2023 at 10:16:23PM +0900, Masami Hiramatsu (Google) wrote:
> 
> SNIP
> 
> > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > index 84e8a0f6e4e0..d3f8745d8ead 100644
> > --- a/kernel/trace/bpf_trace.c
> > +++ b/kernel/trace/bpf_trace.c
> > @@ -2503,7 +2503,7 @@ static int __init bpf_event_init(void)
> >  fs_initcall(bpf_event_init);
> >  #endif /* CONFIG_MODULES */
> >  
> > -#ifdef CONFIG_FPROBE
> > +#if defined(CONFIG_FPROBE) && defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
> >  struct bpf_kprobe_multi_link {
> > struct bpf_link link;
> > struct fprobe fp;
> > @@ -2733,10 +2733,14 @@ kprobe_multi_link_prog_run(struct 
> > bpf_kprobe_multi_link *link,
> >  
> >  static int
> >  kprobe_multi_link_handler(struct fprobe *fp, unsigned long fentry_ip,
> > - unsigned long ret_ip, struct pt_regs *regs,
> > + unsigned long ret_ip, struct ftrace_regs *fregs,
> >   void *data)
> >  {
> > struct bpf_kprobe_multi_link *link;
> > +   struct pt_regs *regs = ftrace_get_regs(fregs);
> > +
> > +   if (!regs)
> > +   return 0;
> >  
> > link = container_of(fp, struct bpf_kprobe_multi_link, fp);
> > kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
> > @@ -3008,7 +3012,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
> > *attr, struct bpf_prog *pr
> > kvfree(cookies);
> > return err;
> >  }
> > -#else /* !CONFIG_FPROBE */
> > +#else /* !CONFIG_FPROBE || !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
> >  int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct 
> > bpf_prog *prog)
> >  {
> > return -EOPNOTSUPP;
> > diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> > index 6cd2a4e3afb8..f12569494d8a 100644
> > --- a/kernel/trace/fprobe.c
> > +++ b/kernel/trace/fprobe.c
> > @@ -46,7 +46,7 @@ static inline void __fprobe_handler(unsigned long ip, 
> > unsigned long parent_ip,
> > }
> >  
> > if (fp->entry_handler)
> > -   ret = fp->entry_handler(fp, ip, parent_ip, 
> > ftrace_get_regs(fregs), entry_data);
> > +   ret = fp->entry_handler(fp, ip, parent_ip, fregs, entry_data);
> >  
> > /* If entry_handler returns !0, nmissed is not counted. */
> > if (rh) {
> > @@ -182,7 +182,7 @@ static void fprobe_init(struct fprobe *fp)
> > fp->ops.func = fprobe_kprobe_handler;
> > else
> > fp->ops.func = fprobe_handler;
> > -   fp->ops.flags |= FTRACE_OPS_FL_SAVE_REGS;
> > +   fp->ops.flags |= FTRACE_OPS_FL_SAVE_ARGS;
> 
> so with this change you move to ftrace_caller trampoline,
> but we need ftrace_regs_caller right?

Yes, that's right.

> 
> otherwise the (!regs) check in kprobe_multi_link_handler
> will be allways true IIUC

Ah, OK. So until we move to fgraph [28/34], keep this flag SAVE_REGS
then kprobe_multi test will pass.

OK, let me keep it so.

Thank you!

> 
> jirka
> 
> >  }
> >  
> >  static int fprobe_init_rethook(struct fprobe *fp, int num)
> > diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
> > index 7d2ddbcfa377..ef6b36fd05ae 100644
> > --- a/kernel/trace/trace_fprobe.c
> > +++ b/kernel/trace/trace_fprobe.c
> > @@ -320,12 +320,16 @@ NOKPROBE_SYMBOL(fexit_perf_func);
> >  #endif /* CONFIG_PERF_EVENTS */
> >  
> >  static int fentry_dispatcher(struct fprobe *fp, unsigned long entry_ip,
> > -unsigned long ret_ip, struct pt_regs *regs,
> > +unsigned long ret_ip, struct ftrace_regs *fregs,
> >  void *entry_data)
> >  {
> > struct trace_fprobe *tf = container_of(fp, struct trace_fprobe, fp);
> > +   struct pt_regs *regs = ftrace_get_regs(fregs);
> > int ret = 0;
> >  
> > +   if (!regs)
> > +   return 0;
> > +
> > if (trace_probe_test_flag(&tf->tp, TP_FLAG_TRACE))
> > fentry_trace_func(tf, entry_ip, regs);
> >  #ifdef CONFIG_PERF_EVENTS
> > diff --git a/lib/test_fprobe.c b/lib/test_fprobe.c
> > index 24de0e5ff859..ff607babba18 100644
> > --- a/lib/test_fprobe.c
> > +++ b/lib/test_fprobe.c
> > @@ -40,7 +40,7 @@ static noinline u32 fprobe_selftest_nest_target(u32 
> > value, u32 (*nest)(u32))
> >  
> >  static notrace int fp_entry_handler(struct fprobe *fp, unsigned long ip,
> > unsigned long ret_ip,
> > -   struct pt_regs *regs, void *data)
> > +   struct ftrace_regs *fregs, void *data)
> >  {
> > KUNIT_EXPECT_FALSE(current_test, preemptible());
> > /* This can be called on the fprobe_selftest_target and the 
> > fprobe_selftest_target2 */
> > @@ -81,7 +81,7 @@ static notrace void fp_exit_handler(struct fprobe *fp, 
> > unsigned long ip,
> >  
> >  static notrace int nest_entry_handler(struct fprobe *fp, unsigned long ip,
> >   unsigned long ret_ip,
> > - struct pt_reg

[PATCH v5 16/15] ring-buffer: Use subbuf_order for buffer page masking

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The comparisons to PAGE_SIZE were all converted to use the
buffer->subbuf_order, but the use of PAGE_MASK was missed.

Convert all the PAGE_MASK usages over to:

  (PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1

Fixes: TBD ("ring-buffer: Page size per ring buffer")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 19 +++
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 7ee6779bf292..173d2595ce2d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -2269,11 +2269,13 @@ rb_commit_index(struct ring_buffer_per_cpu *cpu_buffer)
 }
 
 static __always_inline unsigned
-rb_event_index(struct ring_buffer_event *event)
+rb_event_index(struct ring_buffer_per_cpu *cpu_buffer, struct 
ring_buffer_event *event)
 {
unsigned long addr = (unsigned long)event;
 
-   return (addr & ~PAGE_MASK) - BUF_PAGE_HDR_SIZE;
+   addr &= (PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1;
+
+   return addr - BUF_PAGE_HDR_SIZE;
 }
 
 static void rb_inc_iter(struct ring_buffer_iter *iter)
@@ -2646,7 +2648,8 @@ rb_move_tail(struct ring_buffer_per_cpu *cpu_buffer,
 
 /* Slow path */
 static struct ring_buffer_event *
-rb_add_time_stamp(struct ring_buffer_event *event, u64 delta, bool abs)
+rb_add_time_stamp(struct ring_buffer_per_cpu *cpu_buffer,
+ struct ring_buffer_event *event, u64 delta, bool abs)
 {
if (abs)
event->type_len = RINGBUF_TYPE_TIME_STAMP;
@@ -2654,7 +2657,7 @@ rb_add_time_stamp(struct ring_buffer_event *event, u64 
delta, bool abs)
event->type_len = RINGBUF_TYPE_TIME_EXTEND;
 
/* Not the first event on the page, or not delta? */
-   if (abs || rb_event_index(event)) {
+   if (abs || rb_event_index(cpu_buffer, event)) {
event->time_delta = delta & TS_MASK;
event->array[0] = delta >> TS_SHIFT;
} else {
@@ -2728,7 +2731,7 @@ static void rb_add_timestamp(struct ring_buffer_per_cpu 
*cpu_buffer,
if (!abs)
info->delta = 0;
}
-   *event = rb_add_time_stamp(*event, info->delta, abs);
+   *event = rb_add_time_stamp(cpu_buffer, *event, info->delta, abs);
*length -= RB_LEN_TIME_EXTEND;
*delta = 0;
 }
@@ -2812,10 +2815,10 @@ rb_try_to_discard(struct ring_buffer_per_cpu 
*cpu_buffer,
struct buffer_page *bpage;
unsigned long addr;
 
-   new_index = rb_event_index(event);
+   new_index = rb_event_index(cpu_buffer, event);
old_index = new_index + rb_event_ts_length(event);
addr = (unsigned long)event;
-   addr &= PAGE_MASK;
+   addr &= ~((PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1);
 
bpage = READ_ONCE(cpu_buffer->tail_page);
 
@@ -3726,7 +3729,7 @@ rb_decrement_entry(struct ring_buffer_per_cpu *cpu_buffer,
struct buffer_page *bpage = cpu_buffer->commit_page;
struct buffer_page *start;
 
-   addr &= PAGE_MASK;
+   addr &= ~((PAGE_SIZE << cpu_buffer->buffer->subbuf_order) - 1);
 
/* Do the likely case first */
if (likely(bpage->page == (void *)addr)) {
-- 
2.42.0

Re: [PATCH] modules: wait do_free_init correctly

2023-12-19 Thread Luis Chamberlain

On Tue, Dec 19, 2023 at 12:51:51PM -0800, Andrew Morton wrote:
> On Tue, 19 Dec 2023 22:12:31 +0800 Changbin Du  wrote:
> 
> > The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> > do_free_init() into a global workqueue instead of call_rcu(). So now
> > we should wait it via flush_work().
> 
> What are the runtime effects of this change?

Indeed that's needed given how old this culprit commit is:

git describe --contains 1a7b7d922081
v5.2-rc1~192^2~5

Who did this work and for what reason? What triggered this itch?

Is it perhaps for an out of tree driver that did something funky
on its module exit?

As per Documentation/RCU/rcubarrier.rst rcu_barrier will ensure the
callbacks complete, so interms of determinism both mechanisms will
have waited for the free. It seems we're now just limiting the scope.

This could also mean initialization grew used to having RCU calls on
init complete at this point in time, even for modules, and so localizing
this wait may now also introduce other unexpected behaviour.

  Luis

Re: [PATCH 0/4] Section alignment issues?

2023-12-19 Thread Luis Chamberlain

On Wed, Nov 22, 2023 at 11:18:10PM +0100, del...@kernel.org wrote:
> From: Helge Deller 
> My questions:
> - Am I wrong with my analysis?

This would typically of course depend on the arch, but whether it helps
is what I would like to see with real numbers rather then speculation.
Howeer, I don't expect some of these are hot paths except maybe the
table lookups. So could you look at some perf stat differences
without / with alignment ? Other than bootup live patching would be
a good test case. We have selftest for modules, the script in selftests
tools/testing/selftests/kmod/kmod.sh is pretty aggressive, but the live
patching tests might be better suited.

> - What does people see on other architectures?
> - Does it make sense to add a compile- and runtime-check, like the patch 
> below, to the kernel?

The chatty aspects really depend on the above results.

Aren't there some archs where an unaligned access would actually crash?
Why hasn't that happened?

  Luis

Re: [PATCH] modules: wait do_free_init correctly

2023-12-19 Thread Andrew Morton

On Tue, 19 Dec 2023 22:12:31 +0800 Changbin Du  wrote:

> The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
> do_free_init() into a global workqueue instead of call_rcu(). So now
> we should wait it via flush_work().

What are the runtime effects of this change?

Re: [PATCH v8 0/2] Introducing trace buffer mapping by user-space

2023-12-19 Thread Steven Rostedt

On Tue, 19 Dec 2023 18:45:54 +
Vincent Donnefort  wrote:

> The tracing ring-buffers can be stored on disk or sent to network
> without any copy via splice. However the later doesn't allow real time
> processing of the traces. A solution is to give userspace direct access
> to the ring-buffer pages via a mapping. An application can now become a
> consumer of the ring-buffer, in a similar fashion to what trace_pipe
> offers.
> 
> Attached to this cover letter an example of consuming read for a
> ring-buffer, using libtracefs.
> 

I'm still testing this, but I needed to add this patch to fix two bugs. One
is that you are calling rb_wakeup_waiters() for both the buffer and the
cpu_buffer, and it needs to know which one to use the container_of() macro.

The other is a "goto unlock" that unlocks two locks where only one was taken.

-- Steve

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 35f3736f660b..987ad7bd1e8b 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -389,6 +389,7 @@ struct rb_irq_work {
boolwaiters_pending;
boolfull_waiters_pending;
boolwakeup_full;
+   boolis_cpu_buffer;
 };
 
 /*
@@ -771,10 +772,20 @@ static void rb_update_meta_page(struct 
ring_buffer_per_cpu *cpu_buffer)
 static void rb_wake_up_waiters(struct irq_work *work)
 {
struct rb_irq_work *rbwork = container_of(work, struct rb_irq_work, 
work);
-   struct ring_buffer_per_cpu *cpu_buffer =
-   container_of(rbwork, struct ring_buffer_per_cpu, irq_work);
+   struct ring_buffer_per_cpu *cpu_buffer;
+   struct trace_buffer *buffer;
+   int cpu;
 
-   rb_update_meta_page(cpu_buffer);
+   if (rbwork->is_cpu_buffer) {
+   cpu_buffer = container_of(rbwork, struct ring_buffer_per_cpu, 
irq_work);
+   rb_update_meta_page(cpu_buffer);
+   } else {
+   buffer = container_of(rbwork, struct trace_buffer, irq_work);
+   for_each_buffer_cpu(buffer, cpu) {
+   cpu_buffer = buffer->buffers[cpu];
+   rb_update_meta_page(cpu_buffer);
+   }
+   }
 
wake_up_all(&rbwork->waiters);
if (rbwork->full_waiters_pending || rbwork->wakeup_full) {
@@ -1569,6 +1580,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_waitqueue_head(&cpu_buffer->irq_work.waiters);
init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
mutex_init(&cpu_buffer->mapping_lock);
+   cpu_buffer->irq_work.is_cpu_buffer = true;
 
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
GFP_KERNEL, cpu_to_node(cpu));
@@ -6209,7 +6221,8 @@ int ring_buffer_map(struct trace_buffer *buffer, int cpu)
 
if (cpu_buffer->mapped) {
WRITE_ONCE(cpu_buffer->mapped, cpu_buffer->mapped + 1);
-   goto unlock;
+   mutex_unlock(&cpu_buffer->mapping_lock);
+   return 0;
}
 
/* prevent another thread from changing buffer sizes */

[PATCH v5 15/15] tracing: Update subbuffer with kilobytes not page order

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Using page order for deciding what the size of the ring buffer sub buffers
are is exposing a bit too much of the implementation. Although the sub
buffers are only allocated in orders of pages, allow the user to specify
the minimum size of each sub-buffer via kilobytes like they can with the
buffer size itself.

If the user specifies 3 via:

  echo 3 > buffer_subbuf_size_kb

Then the sub-buffer size will round up to 4kb (on a 4kb page size system).

If they specify:

  echo 6 > buffer_subbuf_size_kb

The sub-buffer size will become 8kb.

and so on.

Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/trace/ftrace.rst| 46 ---
 kernel/trace/trace.c  | 38 +--
 ...fer_order.tc => ringbuffer_subbuf_size.tc} | 18 
 3 files changed, 54 insertions(+), 48 deletions(-)
 rename tools/testing/selftests/ftrace/test.d/00basic/{ringbuffer_order.tc => 
ringbuffer_subbuf_size.tc} (85%)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 231d26ceedb0..933e7efb9f1b 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -203,32 +203,26 @@ of ftrace. Here is a list of some of the key files:
 
This displays the total combined size of all the trace buffers.
 
-  buffer_subbuf_order:
-
-   This sets or displays the sub buffer page size order. The ring buffer
-   is broken up into several same size "sub buffers". An event can not be
-   bigger than the size of the sub buffer. Normally, the sub buffer is
-   the size of the architecture's page (4K on x86). The sub buffer also
-   contains meta data at the start which also limits the size of an event.
-   That means when the sub buffer is a page size, no event can be larger
-   than the page size minus the sub buffer meta data.
-
-   The buffer_subbuf_order allows the user to change the size of the sub
-   buffer. As the sub buffer is a set of pages by the power of 2, thus
-   the sub buffer total size is defined by the order:
-
-   order   size
-   
-   0   PAGE_SIZE
-   1   PAGE_SIZE * 2
-   2   PAGE_SIZE * 4
-   3   PAGE_SIZE * 8
-
-   Changing the order will change the sub buffer size allowing for events
-   to be larger than the page size.
-
-   Note: When changing the order, tracing is stopped and any data in the
-   ring buffer and the snapshot buffer will be discarded.
+  buffer_subbuf_size_kb:
+
+   This sets or displays the sub buffer size. The ring buffer is broken up
+   into several same size "sub buffers". An event can not be bigger than
+   the size of the sub buffer. Normally, the sub buffer is the size of the
+   architecture's page (4K on x86). The sub buffer also contains meta data
+   at the start which also limits the size of an event.  That means when
+   the sub buffer is a page size, no event can be larger than the page
+   size minus the sub buffer meta data.
+
+   Note, the buffer_subbuf_size_kb is a way for the user to specify the
+   minimum size of the subbuffer. The kernel may make it bigger due to the
+   implementation details, or simply fail the operation if the kernel can
+   not handle the request.
+
+   Changing the sub buffer size allows for events to be larger than the
+   page size.
+
+   Note: When changing the sub-buffer size, tracing is stopped and any
+   data in the ring buffer and the snapshot buffer will be discarded.
 
   free_buffer:
 
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 82303bd2bba1..46dbe22121e9 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9384,42 +9384,54 @@ static const struct file_operations buffer_percent_fops 
= {
 };
 
 static ssize_t
-buffer_order_read(struct file *filp, char __user *ubuf, size_t cnt, loff_t 
*ppos)
+buffer_subbuf_size_read(struct file *filp, char __user *ubuf, size_t cnt, 
loff_t *ppos)
 {
struct trace_array *tr = filp->private_data;
+   size_t size;
char buf[64];
+   int order;
int r;
 
-   r = sprintf(buf, "%d\n", 
ring_buffer_subbuf_order_get(tr->array_buffer.buffer));
+   order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
+   size = (PAGE_SIZE << order) / 1024;
+
+   r = sprintf(buf, "%zd\n", size);
 
return simple_read_from_buffer(ubuf, cnt, ppos, buf, r);
 }
 
 static ssize_t
-buffer_order_write(struct file *filp, const char __user *ubuf,
-  size_t cnt, loff_t *ppos)
+buffer_subbuf_size_write(struct file *filp, const char __user *ubuf,
+size_t cnt, loff_t *ppos)
 {
struct trace_array *tr = filp->private_data;
unsigned long val;
int old_order;
+   int order;
+   int pages;
int ret;
 
ret = kstrtoul_from_user(ubu

[PATCH v5 14/15] ringbuffer/selftest: Add basic selftest to test changing subbuf order

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Add a self test that will write into the trace buffer with differ trace
sub buffer order sizes.

Signed-off-by: Steven Rostedt (Google) 
---
 .../ftrace/test.d/00basic/ringbuffer_order.tc | 95 +++
 1 file changed, 95 insertions(+)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc

diff --git a/tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc 
b/tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc
new file mode 100644
index ..ecbcc810e6c1
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/00basic/ringbuffer_order.tc
@@ -0,0 +1,95 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Change the ringbuffer sub-buffer order
+# requires: buffer_subbuf_order
+# flags: instance
+
+get_buffer_data_size() {
+   sed -ne 's/^.*data.*size:\([0-9][0-9]*\).*/\1/p' events/header_page
+}
+
+get_buffer_data_offset() {
+   sed -ne 's/^.*data.*offset:\([0-9][0-9]*\).*/\1/p' events/header_page
+}
+
+get_event_header_size() {
+   type_len=`sed -ne 's/^.*type_len.*:[^0-9]*\([0-9][0-9]*\).*/\1/p' 
events/header_event`
+   time_len=`sed -ne 's/^.*time_delta.*:[^0-9]*\([0-9][0-9]*\).*/\1/p' 
events/header_event`
+   array_len=`sed -ne 's/^.*array.*:[^0-9]*\([0-9][0-9]*\).*/\1/p' 
events/header_event`
+   total_bits=$((type_len+time_len+array_len))
+   total_bits=$((total_bits+7))
+   echo $((total_bits/8))
+}
+
+get_print_event_buf_offset() {
+   sed -ne 's/^.*buf.*offset:\([0-9][0-9]*\).*/\1/p' 
events/ftrace/print/format
+}
+
+event_header_size=`get_event_header_size`
+print_header_size=`get_print_event_buf_offset`
+
+data_offset=`get_buffer_data_offset`
+
+marker_meta=$((event_header_size+print_header_size))
+
+make_str() {
+cnt=$1
+   printf -- 'X%.0s' $(seq $cnt)
+}
+
+write_buffer() {
+   size=$1
+
+   str=`make_str $size`
+
+   # clear the buffer
+   echo > trace
+
+   # write the string into the marker
+   echo $str > trace_marker
+
+   echo $str
+}
+
+test_buffer() {
+   orde=$1
+   page_size=$((4096< buffer_subbuf_order
+   test_buffer $a
+done
+
+echo $ORIG > buffer_subbuf_order
+
-- 
2.42.0

[PATCH v5 13/15] ring-buffer: Add documentation on the buffer_subbuf_order file

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Add to the documentation how to use the buffer_subbuf_order file to change
the size and how it affects what events can be added to the ring buffer.

Signed-off-by: Steven Rostedt (Google) 
---
 Documentation/trace/ftrace.rst | 27 +++
 1 file changed, 27 insertions(+)

diff --git a/Documentation/trace/ftrace.rst b/Documentation/trace/ftrace.rst
index 23572f6697c0..231d26ceedb0 100644
--- a/Documentation/trace/ftrace.rst
+++ b/Documentation/trace/ftrace.rst
@@ -203,6 +203,33 @@ of ftrace. Here is a list of some of the key files:
 
This displays the total combined size of all the trace buffers.
 
+  buffer_subbuf_order:
+
+   This sets or displays the sub buffer page size order. The ring buffer
+   is broken up into several same size "sub buffers". An event can not be
+   bigger than the size of the sub buffer. Normally, the sub buffer is
+   the size of the architecture's page (4K on x86). The sub buffer also
+   contains meta data at the start which also limits the size of an event.
+   That means when the sub buffer is a page size, no event can be larger
+   than the page size minus the sub buffer meta data.
+
+   The buffer_subbuf_order allows the user to change the size of the sub
+   buffer. As the sub buffer is a set of pages by the power of 2, thus
+   the sub buffer total size is defined by the order:
+
+   order   size
+   
+   0   PAGE_SIZE
+   1   PAGE_SIZE * 2
+   2   PAGE_SIZE * 4
+   3   PAGE_SIZE * 8
+
+   Changing the order will change the sub buffer size allowing for events
+   to be larger than the page size.
+
+   Note: When changing the order, tracing is stopped and any data in the
+   ring buffer and the snapshot buffer will be discarded.
+
   free_buffer:
 
If a process is performing tracing, and the ring buffer should be
-- 
2.42.0

[PATCH v5 12/15] ring-buffer: Just update the subbuffers when changing their allocation order

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The ring_buffer_subbuf_order_set() was creating ring_buffer_per_cpu
cpu_buffers with the new subbuffers with the updated order, and if they
all successfully were created, then they the ring_buffer's per_cpu buffers
would be freed and replaced by them.

The problem is that the freed per_cpu buffers contains state that would be
lost. Running the following commands:

1. # echo 3 > /sys/kernel/tracing/buffer_subbuf_order
2. # echo 0 > /sys/kernel/tracing/tracing_cpumask
3. # echo 1 > /sys/kernel/tracing/snapshot
4. # echo ff > /sys/kernel/tracing/tracing_cpumask
5. # echo test > /sys/kernel/tracing/trace_marker

Would result in:

 -bash: echo: write error: Bad file descriptor

That's because the state of the per_cpu buffers of the snapshot buffer is
lost when the order is changed (the order of a freed snapshot buffer goes
to 0 to save memory, and when the snapshot buffer is allocated again, it
goes back to what the main buffer is).

In operation 2, the snapshot buffers were set to "disable" (as all the
ring buffers CPUs were disabled).

In operation 3, the snapshot is allocated and a call to
ring_buffer_subbuf_order_set() replaced the per_cpu buffers losing the
"record_disable" count.

When it was enabled again, the atomic_dec(&cpu_buffer->record_disable) was
decrementing a zero, setting it to -1. Writing 1 into the snapshot would
swap the snapshot buffer with the main buffer, so now the main buffer is
"disabled", and nothing can write to the ring buffer anymore.

Instead of creating new per_cpu buffers and losing the state of the old
buffers, basically do what the resize does and just allocate new subbuf
pages into the new_pages link list of the per_cpu buffer and if they all
succeed, then replace the old sub buffers with the new ones. This keeps
the per_cpu buffer descriptor in tact and by doing so, keeps its state.

Fixes: TBD ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 88 ++
 1 file changed, 71 insertions(+), 17 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 23ead7602da0..7ee6779bf292 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5856,11 +5856,11 @@ EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get);
  */
 int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
 {
-   struct ring_buffer_per_cpu **cpu_buffers;
+   struct ring_buffer_per_cpu *cpu_buffer;
+   struct buffer_page *bpage, *tmp;
int old_order, old_size;
int nr_pages;
int psize;
-   int bsize;
int err;
int cpu;
 
@@ -5874,11 +5874,6 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
if (psize <= BUF_PAGE_HDR_SIZE)
return -EINVAL;
 
-   bsize = sizeof(void *) * buffer->cpus;
-   cpu_buffers = kzalloc(bsize, GFP_KERNEL);
-   if (!cpu_buffers)
-   return -ENOMEM;
-
old_order = buffer->subbuf_order;
old_size = buffer->subbuf_size;
 
@@ -5894,33 +5889,88 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
 
/* Make sure all new buffers are allocated, before deleting the old 
ones */
for_each_buffer_cpu(buffer, cpu) {
+
if (!cpumask_test_cpu(cpu, buffer->cpumask))
continue;
 
+   cpu_buffer = buffer->buffers[cpu];
+
/* Update the number of pages to match the new size */
nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
 
-   cpu_buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, 
cpu);
-   if (!cpu_buffers[cpu]) {
+   /* we need a minimum of two pages */
+   if (nr_pages < 2)
+   nr_pages = 2;
+
+   cpu_buffer->nr_pages_to_update = nr_pages;
+
+   /* Include the reader page */
+   nr_pages++;
+
+   /* Allocate the new size buffer */
+   INIT_LIST_HEAD(&cpu_buffer->new_pages);
+   if (__rb_allocate_pages(cpu_buffer, nr_pages,
+   &cpu_buffer->new_pages)) {
+   /* not enough memory for new pages */
err = -ENOMEM;
goto error;
}
}
 
for_each_buffer_cpu(buffer, cpu) {
+
if (!cpumask_test_cpu(cpu, buffer->cpumask))
continue;
 
-   rb_free_cpu_buffer(buffer->buffers[cpu]);
-   buffer->buffers[cpu] = cpu_buffers[cpu];
+   cpu_buffer = buffer->buffers[cpu];
+
+   /* Clear the head bit to make the link list normal to read */
+   rb_head_page_deactivate(cpu_buffer);
+
+   /* Now walk the list and f

[PATCH v5 09/15] tracing: Update snapshot order along with main buffer order

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

When updating the order of the sub buffers for the main buffer, make sure
that if the snapshot buffer exists, that it gets its order updated as
well.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 45 ++--
 1 file changed, 43 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 4dcdc30aa110..2439e00aa4ce 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1263,10 +1263,17 @@ static void set_buffer_entries(struct array_buffer 
*buf, unsigned long val);
 
 int tracing_alloc_snapshot_instance(struct trace_array *tr)
 {
+   int order;
int ret;
 
if (!tr->allocated_snapshot) {
 
+   /* Make the snapshot buffer have the same order as main buffer 
*/
+   order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
+   ret = ring_buffer_subbuf_order_set(tr->max_buffer.buffer, 
order);
+   if (ret < 0)
+   return ret;
+
/* allocate spare buffer */
ret = resize_buffer_duplicate_size(&tr->max_buffer,
   &tr->array_buffer, RING_BUFFER_ALL_CPUS);
@@ -1286,6 +1293,7 @@ static void free_snapshot(struct trace_array *tr)
 * The max_tr ring buffer has some state (e.g. ring->clock) and
 * we want preserve it.
 */
+   ring_buffer_subbuf_order_set(tr->max_buffer.buffer, 0);
ring_buffer_resize(tr->max_buffer.buffer, 1, RING_BUFFER_ALL_CPUS);
set_buffer_entries(&tr->max_buffer, 1);
tracing_reset_online_cpus(&tr->max_buffer);
@@ -9393,6 +9401,7 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
 {
struct trace_array *tr = filp->private_data;
unsigned long val;
+   int old_order;
int ret;
 
ret = kstrtoul_from_user(ubuf, cnt, 10, &val);
@@ -9403,12 +9412,44 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
if (val < 0 || val > 7)
return -EINVAL;
 
+   old_order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
+   if (old_order == val)
+   return 0;
+
ret = ring_buffer_subbuf_order_set(tr->array_buffer.buffer, val);
if (ret)
-   return ret;
+   return 0;
 
-   (*ppos)++;
+#ifdef CONFIG_TRACER_MAX_TRACE
+
+   if (!tr->allocated_snapshot)
+   goto out_max;
 
+   ret = ring_buffer_subbuf_order_set(tr->max_buffer.buffer, val);
+   if (ret) {
+   /* Put back the old order */
+   cnt = ring_buffer_subbuf_order_set(tr->array_buffer.buffer, 
old_order);
+   if (WARN_ON_ONCE(cnt)) {
+   /*
+* AARGH! We are left with different orders!
+* The max buffer is our "snapshot" buffer.
+* When a tracer needs a snapshot (one of the
+* latency tracers), it swaps the max buffer
+* with the saved snap shot. We succeeded to
+* update the order of the main buffer, but failed to
+* update the order of the max buffer. But when we tried
+* to reset the main buffer to the original size, we
+* failed there too. This is very unlikely to
+* happen, but if it does, warn and kill all
+* tracing.
+*/
+   tracing_disabled = 1;
+   }
+   return ret;
+   }
+ out_max:
+#endif
+   (*ppos)++;
return cnt;
 }
 
-- 
2.42.0

[PATCH v5 11/15] ring-buffer: Keep the same size when updating the order

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The function ring_buffer_subbuf_order_set() just updated the sub-buffers
to the new size, but this also changes the size of the buffer in doing so.
As the size is determined by nr_pages * subbuf_size. If the subbuf_size is
increased without decreasing the nr_pages, this causes the total size of
the buffer to increase.

This broke the latency tracers as the snapshot needs to be the same size
as the main buffer. The size of the snapshot buffer is only expanded when
needed, and because the order is still the same, the size becomes out of
sync with the main buffer, as the main buffer increased in size without
the tracing system knowing.

Calculate the nr_pages to allocate with the new subbuf_size to be
buffer_size / new_subbuf_size.

Fixes: TBD ("ring-buffer: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fdcd171b09b5..23ead7602da0 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5897,7 +5897,10 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
if (!cpumask_test_cpu(cpu, buffer->cpumask))
continue;
 
-   nr_pages = buffer->buffers[cpu]->nr_pages;
+   /* Update the number of pages to match the new size */
+   nr_pages = old_size * buffer->buffers[cpu]->nr_pages;
+   nr_pages = DIV_ROUND_UP(nr_pages, buffer->subbuf_size);
+
cpu_buffers[cpu] = rb_allocate_cpu_buffer(buffer, nr_pages, 
cpu);
if (!cpu_buffers[cpu]) {
err = -ENOMEM;
-- 
2.42.0

[PATCH v5 10/15] tracing: Stop the tracing while changing the ring buffer subbuf size

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Because the main buffer and the snapshot buffer need to be the same for
some tracers, otherwise it will fail and disable all tracing, the tracers
need to be stopped while updating the sub buffer sizes so that the tracers
see the main and snapshot buffers with the same sub buffer size.

Fixes: TBD ("ring-buffer: Add interface for configuring trace sub buffer size")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 2439e00aa4ce..82303bd2bba1 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -9412,13 +9412,16 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
if (val < 0 || val > 7)
return -EINVAL;
 
+   /* Do not allow tracing while changing the order of the ring buffer */
+   tracing_stop_tr(tr);
+
old_order = ring_buffer_subbuf_order_get(tr->array_buffer.buffer);
if (old_order == val)
-   return 0;
+   goto out;
 
ret = ring_buffer_subbuf_order_set(tr->array_buffer.buffer, val);
if (ret)
-   return 0;
+   goto out;
 
 #ifdef CONFIG_TRACER_MAX_TRACE
 
@@ -9445,11 +9448,15 @@ buffer_order_write(struct file *filp, const char __user 
*ubuf,
 */
tracing_disabled = 1;
}
-   return ret;
+   goto out;
}
  out_max:
 #endif
(*ppos)++;
+ out:
+   if (ret)
+   cnt = ret;
+   tracing_start_tr(tr);
return cnt;
 }
 
-- 
2.42.0

[PATCH v5 08/15] ring-buffer: Make sure the spare sub buffer used for reads has same size

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Now that the ring buffer specifies the size of its sub buffers, they all
need to be the same size. When doing a read, a swap is done with a spare
page. Make sure they are the same size before doing the swap, otherwise
the read will fail.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 711095aa731d..4dcdc30aa110 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -7582,6 +7582,7 @@ struct ftrace_buffer_info {
struct trace_iterator   iter;
void*spare;
unsigned intspare_cpu;
+   unsigned intspare_size;
unsigned intread;
 };
 
@@ -8301,6 +8302,15 @@ tracing_buffers_read(struct file *filp, char __user 
*ubuf,
 
page_size = ring_buffer_subbuf_size_get(iter->array_buffer->buffer);
 
+   /* Make sure the spare matches the current sub buffer size */
+   if (info->spare) {
+   if (page_size != info->spare_size) {
+   ring_buffer_free_read_page(iter->array_buffer->buffer,
+  info->spare_cpu, 
info->spare);
+   info->spare = NULL;
+   }
+   }
+
if (!info->spare) {
info->spare = 
ring_buffer_alloc_read_page(iter->array_buffer->buffer,
  iter->cpu_file);
@@ -8309,6 +8319,7 @@ tracing_buffers_read(struct file *filp, char __user *ubuf,
info->spare = NULL;
} else {
info->spare_cpu = iter->cpu_file;
+   info->spare_size = page_size;
}
}
if (!info->spare)
-- 
2.42.0

[PATCH v5 07/15] ring-buffer: Do no swap cpu buffers if order is different

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

As all the subbuffer order (subbuffer sizes) must be the same throughout
the ring buffer, check the order of the buffers that are doing a CPU
buffer swap in ring_buffer_swap_cpu() to make sure they are the same.

If the are not the same, then fail to do the swap, otherwise the ring
buffer will think the CPU buffer has a specific subbuffer size when it
does not.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 3c11e8e811ed..fdcd171b09b5 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5417,6 +5417,9 @@ int ring_buffer_swap_cpu(struct trace_buffer *buffer_a,
if (cpu_buffer_a->nr_pages != cpu_buffer_b->nr_pages)
goto out;
 
+   if (buffer_a->subbuf_order != buffer_b->subbuf_order)
+   goto out;
+
ret = -EAGAIN;
 
if (atomic_read(&buffer_a->record_disabled))
-- 
2.42.0

[PATCH v5 06/15] ring-buffer: Clear pages on error in ring_buffer_subbuf_order_set() failure

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

On failure to allocate ring buffer pages, the pointer to the CPU buffer
pages is freed, but the pages that were allocated previously were not.
Make sure they are freed too.

Fixes: TBD ("tracing: Set new size of the ring buffer sub page")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c2afcf98ea91..3c11e8e811ed 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5927,6 +5927,7 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
for_each_buffer_cpu(buffer, cpu) {
if (!cpu_buffers[cpu])
continue;
+   rb_free_cpu_buffer(cpu_buffers[cpu]);
kfree(cpu_buffers[cpu]);
}
kfree(cpu_buffers);
-- 
2.42.0

[PATCH v5 04/15] ring-buffer: Set new size of the ring buffer sub page

2023-12-19 Thread Steven Rostedt

From: "Tzvetomir Stoyanov (VMware)" 

There are two approaches when changing the size of the ring buffer
sub page:
 1. Destroying all pages and allocating new pages with the new size.
 2. Allocating new pages, copying the content of the old pages before
destroying them.
The first approach is easier, it is selected in the proposed
implementation. Changing the ring buffer sub page size is supposed to
not happen frequently. Usually, that size should be set only once,
when the buffer is not in use yet and is supposed to be empty.

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-5-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 80 ++
 1 file changed, 73 insertions(+), 7 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 20fc0121735d..b928bd03dd0e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -332,6 +332,7 @@ struct buffer_page {
unsigned read;  /* index for next read */
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
+   unsigned order; /* order of the page */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -362,7 +363,7 @@ static __always_inline unsigned int rb_page_commit(struct 
buffer_page *bpage)
 
 static void free_buffer_page(struct buffer_page *bpage)
 {
-   free_page((unsigned long)bpage->page);
+   free_pages((unsigned long)bpage->page, bpage->order);
kfree(bpage);
 }
 
@@ -1460,10 +1461,12 @@ static int __rb_allocate_pages(struct 
ring_buffer_per_cpu *cpu_buffer,
 
list_add(&bpage->list, pages);
 
-   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags, 
0);
+   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
bpage->page = page_address(page);
+   bpage->order = cpu_buffer->buffer->subbuf_order;
rb_init_page(bpage->page);
 
if (user_thread && fatal_signal_pending(current))
@@ -1542,7 +1545,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
rb_check_bpage(cpu_buffer, bpage);
 
cpu_buffer->reader_page = bpage;
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 0);
+
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 
cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -1626,6 +1630,7 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long 
size, unsigned flags,
goto fail_free_buffer;
 
/* Default buffer page size - one system page */
+   buffer->subbuf_order = 0;
buffer->subbuf_size = PAGE_SIZE - BUF_PAGE_HDR_SIZE;
 
/* Max payload is buffer page size - header (8bytes) */
@@ -5503,8 +5508,8 @@ void *ring_buffer_alloc_read_page(struct trace_buffer 
*buffer, int cpu)
if (bpage)
goto out;
 
-   page = alloc_pages_node(cpu_to_node(cpu),
-   GFP_KERNEL | __GFP_NORETRY, 0);
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
return ERR_PTR(-ENOMEM);
 
@@ -5553,7 +5558,7 @@ void ring_buffer_free_read_page(struct trace_buffer 
*buffer, int cpu, void *data
local_irq_restore(flags);
 
  out:
-   free_page((unsigned long)bpage);
+   free_pages((unsigned long)bpage, buffer->subbuf_order);
 }
 EXPORT_SYMBOL_GPL(ring_buffer_free_read_page);
 
@@ -5813,7 +5818,13 @@ EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get);
  */
 int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
 {
+   struct ring_buffer_per_cpu **cpu_buffers;
+   int old_order, old_size;
+   int nr_pages;
int psize;
+   int bsize;
+   int err;
+   int cpu;
 
if (!buffer || order < 0)
return -EINVAL;
@@ -5825,12 +5836,67 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
if (psize <= BUF_PAGE_HDR_SIZE)
return -EINVAL;
 
+   bsize = sizeof(void *) * buffer->cpus;
+   cpu_buffers = kzalloc(bsize, GFP_KERNEL);
+   if (!cpu_buffers)
+   return -ENOMEM;
+
+   old_order = buffer->subbuf_order;
+   old_size = buffer->subbuf_size;
+
+   /* prevent another thread from changing buffer sizes */
+   mutex_lock(&buffer->mutex);
+   atomic_inc(&buffer->record_disabled);
+
+   /* Make sure all commits have finished */
+   synchronize_rcu();
+
buffer->subb

[PATCH v5 05/15] ring-buffer: Read and write to ring buffers with custom sub buffer size

2023-12-19 Thread Steven Rostedt

From: "Tzvetomir Stoyanov (VMware)" 

As the size of the ring sub buffer page can be changed dynamically,
the logic that reads and writes to the buffer should be fixed to take
that into account. Some internal ring buffer APIs are changed:
 ring_buffer_alloc_read_page()
 ring_buffer_free_read_page()
 ring_buffer_read_page()
A new API is introduced:
 ring_buffer_read_page_data()

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-6-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
[ Fixed kerneldoc on data_page parameter in ring_buffer_free_read_page() ]
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h  | 11 ++--
 kernel/trace/ring_buffer.c   | 75 
 kernel/trace/ring_buffer_benchmark.c | 10 ++--
 kernel/trace/trace.c | 34 +++--
 4 files changed, 89 insertions(+), 41 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index 12573306b889..fa802db216f9 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -192,10 +192,15 @@ bool ring_buffer_time_stamp_abs(struct trace_buffer 
*buffer);
 size_t ring_buffer_nr_pages(struct trace_buffer *buffer, int cpu);
 size_t ring_buffer_nr_dirty_pages(struct trace_buffer *buffer, int cpu);
 
-void *ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu);
-void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu, void 
*data);
-int ring_buffer_read_page(struct trace_buffer *buffer, void **data_page,
+struct buffer_data_read_page;
+struct buffer_data_read_page *
+ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu);
+void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
+   struct buffer_data_read_page *page);
+int ring_buffer_read_page(struct trace_buffer *buffer,
+ struct buffer_data_read_page *data_page,
  size_t len, int cpu, int full);
+void *ring_buffer_read_page_data(struct buffer_data_read_page *page);
 
 struct trace_seq;
 
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index b928bd03dd0e..c2afcf98ea91 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -318,6 +318,11 @@ struct buffer_data_page {
unsigned chardata[] RB_ALIGN_DATA;  /* data of buffer page */
 };
 
+struct buffer_data_read_page {
+   unsignedorder;  /* order of the page */
+   struct buffer_data_page *data;  /* actual data, stored in this page */
+};
+
 /*
  * Note, the buffer_page list must be first. The buffer pages
  * are allocated in cache lines, which means that each buffer
@@ -5483,40 +5488,48 @@ EXPORT_SYMBOL_GPL(ring_buffer_swap_cpu);
  * Returns:
  *  The page allocated, or ERR_PTR
  */
-void *ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
+struct buffer_data_read_page *
+ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu)
 {
struct ring_buffer_per_cpu *cpu_buffer;
-   struct buffer_data_page *bpage = NULL;
+   struct buffer_data_read_page *bpage = NULL;
unsigned long flags;
struct page *page;
 
if (!cpumask_test_cpu(cpu, buffer->cpumask))
return ERR_PTR(-ENODEV);
 
+   bpage = kzalloc(sizeof(*bpage), GFP_KERNEL);
+   if (!bpage)
+   return ERR_PTR(-ENOMEM);
+
+   bpage->order = buffer->subbuf_order;
cpu_buffer = buffer->buffers[cpu];
local_irq_save(flags);
arch_spin_lock(&cpu_buffer->lock);
 
if (cpu_buffer->free_page) {
-   bpage = cpu_buffer->free_page;
+   bpage->data = cpu_buffer->free_page;
cpu_buffer->free_page = NULL;
}
 
arch_spin_unlock(&cpu_buffer->lock);
local_irq_restore(flags);
 
-   if (bpage)
+   if (bpage->data)
goto out;
 
page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
cpu_buffer->buffer->subbuf_order);
-   if (!page)
+   if (!page) {
+   kfree(bpage);
return ERR_PTR(-ENOMEM);
+   }
 
-   bpage = page_address(page);
+   bpage->data = page_address(page);
 
  out:
-   rb_init_page(bpage);
+   rb_init_page(bpage->data);
 
return bpage;
 }
@@ -5526,14 +5539,15 @@ EXPORT_SYMBOL_GPL(ring_buffer_alloc_read_page);
  * ring_buffer_free_read_page - free an allocated read page
  * @buffer: the buffer the page was allocate for
  * @cpu: the cpu buffer the page came from
- * @data: the page to free
+ * @data_page: the page to free
  *
  * Free a page allocated from ring_buffer_alloc_read_page.
  */
-void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu, void 
*data)
+void ring_buffer_free_read_page(struct trace_buffer *buffer, int cpu,
+   struct buffer_data_read_page *data_page)
 {
struct ring_buffe

[PATCH v5 03/15] ring-buffer: Add interface for configuring trace sub buffer size

2023-12-19 Thread Steven Rostedt

From: "Tzvetomir Stoyanov (VMware)" 

The trace ring buffer sub page size can be configured, per trace
instance. A new ftrace file "buffer_subbuf_order" is added to get and
set the size of the ring buffer sub page for current trace instance.
The size must be an order of system page size, that's why the new
interface works with system page order, instead of absolute page size:
0 means the ring buffer sub page is equal to 1 system page and so
forth:
0 - 1 system page
1 - 2 system pages
2 - 4 system pages
...
The ring buffer sub page size is limited between 1 and 128 system
pages. The default value is 1 system page.
New ring buffer APIs are introduced:
 ring_buffer_subbuf_order_set()
 ring_buffer_subbuf_order_get()
 ring_buffer_subbuf_size_get()

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-4-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  4 ++
 kernel/trace/ring_buffer.c  | 73 +
 kernel/trace/trace.c| 48 
 3 files changed, 125 insertions(+)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index ce46218ce46d..12573306b889 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -202,6 +202,10 @@ struct trace_seq;
 int ring_buffer_print_entry_header(struct trace_seq *s);
 int ring_buffer_print_page_header(struct trace_buffer *buffer, struct 
trace_seq *s);
 
+int ring_buffer_subbuf_order_get(struct trace_buffer *buffer);
+int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order);
+int ring_buffer_subbuf_size_get(struct trace_buffer *buffer);
+
 enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
 };
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index d9f656502400..20fc0121735d 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -507,6 +507,7 @@ struct trace_buffer {
booltime_stamp_abs;
 
unsigned intsubbuf_size;
+   unsigned intsubbuf_order;
unsigned intmax_data_size;
 };
 
@@ -5761,6 +5762,78 @@ int ring_buffer_read_page(struct trace_buffer *buffer,
 }
 EXPORT_SYMBOL_GPL(ring_buffer_read_page);
 
+/**
+ * ring_buffer_subbuf_size_get - get size of the sub buffer.
+ * @buffer: the buffer to get the sub buffer size from
+ *
+ * Returns size of the sub buffer, in bytes.
+ */
+int ring_buffer_subbuf_size_get(struct trace_buffer *buffer)
+{
+   return buffer->subbuf_size + BUF_PAGE_HDR_SIZE;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_subbuf_size_get);
+
+/**
+ * ring_buffer_subbuf_order_get - get order of system sub pages in one buffer 
page.
+ * @buffer: The ring_buffer to get the system sub page order from
+ *
+ * By default, one ring buffer sub page equals to one system page. This 
parameter
+ * is configurable, per ring buffer. The size of the ring buffer sub page can 
be
+ * extended, but must be an order of system page size.
+ *
+ * Returns the order of buffer sub page size, in system pages:
+ * 0 means the sub buffer size is 1 system page and so forth.
+ * In case of an error < 0 is returned.
+ */
+int ring_buffer_subbuf_order_get(struct trace_buffer *buffer)
+{
+   if (!buffer)
+   return -EINVAL;
+
+   return buffer->subbuf_order;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_get);
+
+/**
+ * ring_buffer_subbuf_order_set - set the size of ring buffer sub page.
+ * @buffer: The ring_buffer to set the new page size.
+ * @order: Order of the system pages in one sub buffer page
+ *
+ * By default, one ring buffer pages equals to one system page. This API can be
+ * used to set new size of the ring buffer page. The size must be order of
+ * system page size, that's why the input parameter @order is the order of
+ * system pages that are allocated for one ring buffer page:
+ *  0 - 1 system page
+ *  1 - 2 system pages
+ *  3 - 4 system pages
+ *  ...
+ *
+ * Returns 0 on success or < 0 in case of an error.
+ */
+int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order)
+{
+   int psize;
+
+   if (!buffer || order < 0)
+   return -EINVAL;
+
+   if (buffer->subbuf_order == order)
+   return 0;
+
+   psize = (1 << order) * PAGE_SIZE;
+   if (psize <= BUF_PAGE_HDR_SIZE)
+   return -EINVAL;
+
+   buffer->subbuf_order = order;
+   buffer->subbuf_size = psize - BUF_PAGE_HDR_SIZE;
+
+   /* Todo: reset the buffer with the new page size */
+
+   return 0;
+}
+EXPORT_SYMBOL_GPL(ring_buffer_subbuf_order_set);
+
 /*
  * We only allocate new buffers, never free them if the CPU goes down.
  * If we were to free the buffer, then the user would lose any trace that was 
in
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 76dd0a4c8cb5..a010aba4c4a4 100644
--- a/kernel/trace/trace.c
+++ b/kernel/tra

[PATCH v5 02/15] ring-buffer: Page size per ring buffer

2023-12-19 Thread Steven Rostedt

From: "Tzvetomir Stoyanov (VMware)" 

Currently the size of one sub buffer page is global for all buffers and
it is hard coded to one system page. In order to introduce configurable
ring buffer sub page size, the internal logic should be refactored to
work with sub page size per ring buffer.

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-3-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 include/linux/ring_buffer.h |  2 +-
 kernel/trace/ring_buffer.c  | 68 +
 kernel/trace/trace.c|  2 +-
 kernel/trace/trace.h|  1 +
 kernel/trace/trace_events.c | 59 
 5 files changed, 86 insertions(+), 46 deletions(-)

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index b1b03b2c0f08..ce46218ce46d 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -200,7 +200,7 @@ int ring_buffer_read_page(struct trace_buffer *buffer, void 
**data_page,
 struct trace_seq;
 
 int ring_buffer_print_entry_header(struct trace_seq *s);
-int ring_buffer_print_page_header(struct trace_seq *s);
+int ring_buffer_print_page_header(struct trace_buffer *buffer, struct 
trace_seq *s);
 
 enum ring_buffer_flags {
RB_FL_OVERWRITE = 1 << 0,
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 2400c8e68fd3..d9f656502400 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -374,11 +374,6 @@ static inline bool test_time_stamp(u64 delta)
return !!(delta & TS_DELTA_TEST);
 }
 
-#define BUF_PAGE_SIZE (PAGE_SIZE - BUF_PAGE_HDR_SIZE)
-
-/* Max payload is BUF_PAGE_SIZE - header (8bytes) */
-#define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
-
 struct rb_irq_work {
struct irq_work work;
wait_queue_head_t   waiters;
@@ -510,6 +505,9 @@ struct trace_buffer {
 
struct rb_irq_work  irq_work;
booltime_stamp_abs;
+
+   unsigned intsubbuf_size;
+   unsigned intmax_data_size;
 };
 
 struct ring_buffer_iter {
@@ -523,10 +521,11 @@ struct ring_buffer_iter {
u64 read_stamp;
u64 page_stamp;
struct ring_buffer_event*event;
+   size_t  event_size;
int missed_events;
 };
 
-int ring_buffer_print_page_header(struct trace_seq *s)
+int ring_buffer_print_page_header(struct trace_buffer *buffer, struct 
trace_seq *s)
 {
struct buffer_data_page field;
 
@@ -550,7 +549,7 @@ int ring_buffer_print_page_header(struct trace_seq *s)
trace_seq_printf(s, "\tfield: char data;\t"
 "offset:%u;\tsize:%u;\tsigned:%u;\n",
 (unsigned int)offsetof(typeof(field), data),
-(unsigned int)BUF_PAGE_SIZE,
+(unsigned int)buffer->subbuf_size,
 (unsigned int)is_signed_type(char));
 
return !trace_seq_has_overflowed(s);
@@ -1625,7 +1624,13 @@ struct trace_buffer *__ring_buffer_alloc(unsigned long 
size, unsigned flags,
if (!zalloc_cpumask_var(&buffer->cpumask, GFP_KERNEL))
goto fail_free_buffer;
 
-   nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+   /* Default buffer page size - one system page */
+   buffer->subbuf_size = PAGE_SIZE - BUF_PAGE_HDR_SIZE;
+
+   /* Max payload is buffer page size - header (8bytes) */
+   buffer->max_data_size = buffer->subbuf_size - (sizeof(u32) * 2);
+
+   nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size);
buffer->flags = flags;
buffer->clock = trace_clock_local;
buffer->reader_lock_key = key;
@@ -1944,7 +1949,7 @@ static void update_pages_handler(struct work_struct *work)
  * @size: the new size.
  * @cpu_id: the cpu buffer to resize
  *
- * Minimum size is 2 * BUF_PAGE_SIZE.
+ * Minimum size is 2 * buffer->subbuf_size.
  *
  * Returns 0 on success and < 0 on failure.
  */
@@ -1966,7 +1971,7 @@ int ring_buffer_resize(struct trace_buffer *buffer, 
unsigned long size,
!cpumask_test_cpu(cpu_id, buffer->cpumask))
return 0;
 
-   nr_pages = DIV_ROUND_UP(size, BUF_PAGE_SIZE);
+   nr_pages = DIV_ROUND_UP(size, buffer->subbuf_size);
 
/* we need a minimum of two pages */
if (nr_pages < 2)
@@ -2213,7 +2218,7 @@ rb_iter_head_event(struct ring_buffer_iter *iter)
 */
barrier();
 
-   if ((iter->head + length) > commit || length > BUF_PAGE_SIZE)
+   if ((iter->head + length) > commit || length > iter->event_size)
/* Writer corrupted the read? */
goto reset;
 
@@ -2446,6 +2451,7 @@ static inline void
 rb_reset_tail(struct ring_buffer_per_cpu *cpu_buffer,
  unsigned long tai

[PATCH v5 01/15] ring-buffer: Refactor ring buffer implementation

2023-12-19 Thread Steven Rostedt

From: "Tzvetomir Stoyanov (VMware)" 

In order to introduce sub-buffer size per ring buffer, some internal
refactoring is needed. As ring_buffer_print_page_header() will depend on
the trace_buffer structure, it is moved after the structure definition.

Link: 
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-2-tz.stoya...@gmail.com

Signed-off-by: Tzvetomir Stoyanov (VMware) 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 60 +++---
 1 file changed, 30 insertions(+), 30 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f7dc74e45ebf..2400c8e68fd3 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -379,36 +379,6 @@ static inline bool test_time_stamp(u64 delta)
 /* Max payload is BUF_PAGE_SIZE - header (8bytes) */
 #define BUF_MAX_DATA_SIZE (BUF_PAGE_SIZE - (sizeof(u32) * 2))
 
-int ring_buffer_print_page_header(struct trace_seq *s)
-{
-   struct buffer_data_page field;
-
-   trace_seq_printf(s, "\tfield: u64 timestamp;\t"
-"offset:0;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)sizeof(field.time_stamp),
-(unsigned int)is_signed_type(u64));
-
-   trace_seq_printf(s, "\tfield: local_t commit;\t"
-"offset:%u;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)offsetof(typeof(field), commit),
-(unsigned int)sizeof(field.commit),
-(unsigned int)is_signed_type(long));
-
-   trace_seq_printf(s, "\tfield: int overwrite;\t"
-"offset:%u;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)offsetof(typeof(field), commit),
-1,
-(unsigned int)is_signed_type(long));
-
-   trace_seq_printf(s, "\tfield: char data;\t"
-"offset:%u;\tsize:%u;\tsigned:%u;\n",
-(unsigned int)offsetof(typeof(field), data),
-(unsigned int)BUF_PAGE_SIZE,
-(unsigned int)is_signed_type(char));
-
-   return !trace_seq_has_overflowed(s);
-}
-
 struct rb_irq_work {
struct irq_work work;
wait_queue_head_t   waiters;
@@ -556,6 +526,36 @@ struct ring_buffer_iter {
int missed_events;
 };
 
+int ring_buffer_print_page_header(struct trace_seq *s)
+{
+   struct buffer_data_page field;
+
+   trace_seq_printf(s, "\tfield: u64 timestamp;\t"
+"offset:0;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)sizeof(field.time_stamp),
+(unsigned int)is_signed_type(u64));
+
+   trace_seq_printf(s, "\tfield: local_t commit;\t"
+"offset:%u;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)offsetof(typeof(field), commit),
+(unsigned int)sizeof(field.commit),
+(unsigned int)is_signed_type(long));
+
+   trace_seq_printf(s, "\tfield: int overwrite;\t"
+"offset:%u;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)offsetof(typeof(field), commit),
+1,
+(unsigned int)is_signed_type(long));
+
+   trace_seq_printf(s, "\tfield: char data;\t"
+"offset:%u;\tsize:%u;\tsigned:%u;\n",
+(unsigned int)offsetof(typeof(field), data),
+(unsigned int)BUF_PAGE_SIZE,
+(unsigned int)is_signed_type(char));
+
+   return !trace_seq_has_overflowed(s);
+}
+
 static inline void rb_time_read(rb_time_t *t, u64 *ret)
 {
*ret = local64_read(&t->time);
-- 
2.42.0

[PATCH v5 00/15] ring-buffer/tracing: Allow ring buffer to have bigger sub buffers

2023-12-19 Thread Steven Rostedt



Note, this has been on my todo list since the ring buffer was created back
in 2008.

Tzvetomir last worked on this in 2021 and I need to finally get it in.

His last series was:

  
https://lore.kernel.org/linux-trace-devel/20211213094825.61876-1-tz.stoya...@gmail.com/

With the description of:

   Currently the size of one sub buffer page is global for all buffers and
   it is hard coded to one system page. The patch set introduces configurable
   ring buffer sub page size, per ring buffer. A new user space interface is
   introduced, which allows to change the sub page size of the ftrace buffer,
   per ftrace instance.

I'm pulling in his patches mostly untouched, except that I had to tweak
a few things to forward port them.

The issues I found I added as the last 7 patches to the series, and then
I added documentation and a selftest, and then changed the UI file
from buffer_subbuf_order to buffer_subbuf_size_kb.

Basically, events to the tracing subsystem are limited to just under a
PAGE_SIZE, as the ring buffer is split into "sub buffers" of one page
size, and an event can not be bigger than a sub buffer. This allows users
to change the size of a sub buffer by the order:

  echo 3 > /sys/kernel/tracing/buffer_subbuf_order

[
  last patch updates this to:

  echo 32 > /sys/kernel/tracing/buffer_subbuf_size_kb

]
  

Will make each sub buffer a size of 8 pages (32KB), allowing events to be
almost as big as 8 pages in size (sub buffers do have meta data on them as
well, keeping an event from reaching the same size as a sub buffer).

Changes since v4: 
https://lore.kernel.org/linux-trace-kernel/20231215175502.106587...@goodmis.org/

- Rebase on latest trace/core

- Fixed a kerneldoc issues reported by kernel test robot


Changes since v3: 
https://lore.kernel.org/all/20231213181737.791847...@goodmis.org/

- Rebase on trace/core:

  
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=trace/core&id=59c28cc95f0a9f5556561e2416af4ecf86184b71

Changes since v2: 
https://lore.kernel.org/all/20231213021914.361709...@goodmis.org/

- Fixed up the subbuf tests to ignore multiple spaces after before the
  'buf' string (same fix as was done for the trace_marker test).

- This time include Cc'ing linux-trace-kernel :-p

Changes since v1: 
https://lore.kernel.org/all/20231210040448.569462...@goodmis.org/

- Add the last patch that changes the ABI from a file called:

  buffer_subbuf_order  to   buffer_subbuf_size_kb

  That is, I kept the old interface the same, but then added the last
  patch that converts the interface into the one that makes more sense.
  I like keeping this in the git history, especially because of the
  way the implemantion is.

- I rebased on top of trace/core in the:

git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git

- I made the tests a bit more advanced. Still a smoke test, but it
  now checks if the string written is the same as the string read.

[PATCH v8 2/2] tracing: Allow user-space mapping of the ring-buffer

2023-12-19 Thread Vincent Donnefort

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index f950648b0ba9..8c49489c5867 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -26,4 +26,6 @@ struct trace_buffer_meta {
__u32   meta_struct_len;/* Len of this struct */
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _UAPI_TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index b35c85edbb49..8dee225cf977 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -8590,15 +8590,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = &info->iter;
+   int err;
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
 
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
+
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(&trace_types_lock);
 
iter->wait_index++;
@@ -8611,6 +8627,62 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+static void tracing_buffers_mmap_close(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+
+   ring_buffer_unmap(iter->array_buffer->buffer, iter->cpu_file);
+}
+
+static void tracing_buffers_mmap_open(struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = vma->vm_file->private_data;
+   struct trace_iterator *iter = &info->iter;
+
+   WARN_ON(ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file));
+}
+
+static const struct vm_operations_struct tracing_buffers_vmops = {
+   .open   = tracing_buffers_mmap_open,
+   .close  = tracing_buffers_mmap_close,
+   .fault  = tracing_buffers_mmap_fault,
+};
+
+static int tracing_buffers_mmap(struct file *filp, struct vm_area_struct *vma)
+{
+   struct ftrace_buffer_info *info = filp->private_data;
+   struct trace_iterator *iter = &info->iter;
+
+   if (vma->vm_flags & VM_WRITE)
+   return -EPERM;
+
+   vm_flags_mod(vma, VM_DONTCOPY | VM_DONTDUMP, VM_MAYWRITE);
+   vma->vm_ops = &tracing_buffers_vmops;
+
+   return ring_buffer_map(iter->array_buffer->buffer, iter->cpu_file);
+}
+
 static const struct file_operations tracing_buffers_fops = {
.open   = tracing_buffers_open,
.rea

[PATCH v8 1/2] ring-buffer: Introducing ring-buffer mapping functions

2023-12-19 Thread Vincent Donnefort

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..f950648b0ba9
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_TRACE_MMAP_H_
+#define _UAPI_TRACE_MMAP_H_
+
+#include 
+
+struct trace_buffer_meta {
+   unsigned long   entries;
+   unsigned long   overrun;
+   unsigned long   read;
+
+   unsigned long   subbufs_touched;
+   unsigned long   subbufs_lost;
+   unsigned long   subbufs_read;
+
+   struct {
+   unsigned long   lost_events;/* Events lost at the time of 
the reader swap */
+   __u32   id; /* Reader subbuf ID from 0 to 
nr_subbufs - 1 */
+   __u32   read;   /* Number of bytes read on the 
reader subbuf */
+   } reader;
+
+   __u32   subbuf_size;/* Size of each subbuf 
including the header */
+   __u32   nr_subbufs; /* Number of subbufs in the 
ring-buffer */
+
+   __u32   meta_page_size; /* Size of the meta-page */
+   __u32   meta_struct_len;/* Len of this struct */
+};
+
+#endif /* _UAPI_TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 9b95297339b6..ed788721e3c0 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -337,6 +337,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -483,6 +484,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   int mapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -760,6 +767,22 @@ static __always_inline bool full_hit(struct trace_buffer 
*buffer, int cpu, int f
return (dirty * 100) > (full * nr_pages);
 }
 
+static void rb_update_meta_page(struct ring_buffer_per_cpu *cpu_buffer)
+{
+   if (unlikely(READ_ONCE(cpu_buffer->mapped))) {
+   /* Ensure the meta_page is ready */
+   smp_rmb();
+   WRITE_ONCE(cpu_buffer->meta_page->entries,
+  local_read(&cpu_buffer->entries));
+   WRITE_ONCE(cpu_buffer->meta_page->overrun,
+  local_read(&cpu_buffer->overrun));
+   WRITE_ONCE(cpu_buffer->meta_page->subbufs_touched,
+  local_read(&cpu_buffer->pages_touched));
+   WRITE_ONCE(cpu_buffer->meta_page->subbufs_lost,
+  local_read(&cpu_buf

[PATCH v8 0/2] Introducing trace buffer mapping by user-space

2023-12-19 Thread Vincent Donnefort

The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Attached to this cover letter an example of consuming read for a
ring-buffer, using libtracefs.

Vincent

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

---

/* Need to access private struct to save counters */
struct kbuffer {
unsigned long long  timestamp;
long long   lost_events;
unsigned long   flags;
void*subbuffer;
void*data;
unsigned intindex;
unsigned intcurr;
unsigned intnext;
unsigned intsize;
unsigned intstart;
unsigned intfirst;

unsigned int (*read_4)(void *ptr);
unsigned long long (*read_8)(void *ptr);
unsigned long long (*read_long)(struct kbuffer *kbuf, void *ptr);
int (*next_event)(struct kbuffer *kbuf);
};

struct trace_buffer_meta {
unsigned long   entries;
unsigned long   overrun;
unsigned long   read;

unsigned long   subbufs_touched;
unsigned long   subbufs_lost;
unsigned long   subbufs_read;

struct {
unsigned long   lost_events;/* Events lost at the time of 
the reader swap */
__u32   id; /* Reader subbuf ID from 0 to 
nr_subbufs - 1 */
__u32   read;   /* Number of bytes read on the 
reader subbuf */
} reader;

__u32   subbuf_size;
__u32   nr_subbufs; /* Number of subbufs in the 
ring-buffer */

__u32   meta_page_size; /* Size of the meta-page */
__u32   meta_struct_len;/* Len of this struct */
};

static char *argv0;
static bool exit_requested;

static char *get_this_name(void)
{
static char *this_name;
char *arg;
char *p;

if (this_name)
return this_name;

arg = argv0;
p = arg+strlen(arg);

while (p >= arg && *p != '/')
p--;
p++;

this_name = p;
return p;
}

static void __vdie(const char *fmt, va_list ap, int err)
{
int ret = errno;
char *p = get_this_name();

if (err && errno)
perror(p);
else
ret = -1;

fprintf(stderr, "  ");
vfprintf(stderr, fmt, ap);

fprintf(stderr, "\n");
exit(ret);
}

void pdie(const char *fmt, ...)
{
va_list ap;

va_start(ap, fmt);
__vdie(fmt, ap, 1);
va_end(ap);
}

static void read_subbuf(struct tep_handle *tep, struct kbuffer *kbuf)
{
static struct trace_seq seq;
struct tep_record record;

if (seq.buffer)
trace_seq_reset(&seq);
else
trace_seq_init(&seq);

while ((record.data = kbuffer_read_event(kbuf, &record.ts))) {
record.size = kbuffer_event_size(kbuf);
kbuffer_next_event(kbuf, NULL);
tep_print_event(tep, &seq, &record,
"%s-%d %9d\t%s: %s\n", TEP_PRINT_COMM,
TEP_

Re: [PATCH RFC 1/4] virtio-net: support transmit hash report

2023-12-19 Thread Michael S. Tsirkin

On Mon, Dec 18, 2023 at 12:37:08PM +0100, Steffen Trumtrar wrote:
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index cc65ef0f3c3e2..698a11f8c6ab9 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -56,6 +56,7 @@
>  #define VIRTIO_NET_F_MQ  22  /* Device supports Receive Flow
>* Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23/* Set MAC address */
> +#define VIRTIO_NET_F_TX_HASH   51/* Driver sends hash report */
>  #define VIRTIO_NET_F_VQ_NOTF_COAL 52 /* Device supports virtqueue 
> notification coalescing */
>  #define VIRTIO_NET_F_NOTF_COAL   53  /* Device supports 
> notifications coalescing */
>  #define VIRTIO_NET_F_GUEST_USO4  54  /* Guest can handle USOv4 in. */


Please make sure to register the feature bit with virtio tc
to avoid collisions.

-- 
MST

Re: [PATCH v5 4/4] vduse: Add LSM hook to check Virtio device type

2023-12-19 Thread Paul Moore

On Mon, Dec 18, 2023 at 12:21 PM Stephen Smalley
 wrote:
> On Tue, Dec 12, 2023 at 8:17 AM Maxime Coquelin
>  wrote:
> > This patch introduces a LSM hook for devices creation,
> > destruction (ioctl()) and opening (open()) operations,
> > checking the application is allowed to perform these
> > operations for the Virtio device type.
>
> Can you explain why the existing LSM hooks and SELinux implementation
> are not sufficient? We already control the ability to open device
> nodes via selinux_inode_permission() and selinux_file_open(), and can
> support fine-grained per-cmd ioctl checking via selinux_file_ioctl().
> And it should already be possible to label these nodes distinctly
> through existing mechanisms (file_contexts if udev-created/labeled,
> genfs_contexts if kernel-created). What exactly can't you do today
> that this hook enables?

I asked something similar back in the v4 patchset to see if there was
some special labeling concerns that required the use of a dedicated
hook and from what I can see there are none.

> Other items to consider:
> - If vduse devices are created using anonymous inodes, then SELinux
> grew a general facility for labeling and controlling the creation of
> those via selinux_inode_init_security_anon().

For the vduse folks, the associated LSM API function is
security_inode_init_security_anon(); please don't call into SELinux
directly.

> - You can encode information about the device into its SELinux type
> that then allows you to distinguish things like net vs block based on
> the device's SELinux security context rather than encoding that in the
> permission bits.
> - If you truly need new LSM hooks (which you need to prove first),
> then you should pass some usable information about the object in
> question to allow SELinux to find a security context for it. Like an
> inode associated with the device, for example.

I agree with Stephen and I still remain skeptical that these hooks are
needed.  Until I see a compelling case as to why the existing LSM
hooks are not sufficient I can't ACK these hooks.

-- 
paul-moore.com

[PATCH vhost v4 15/15] vdpa/mlx5: Add mkey leak detection

2023-12-19 Thread Dragos Tatulea

Track allocated mrs in a list and show warning when leaks are detected
on device free or reset.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  2 ++
 drivers/vdpa/mlx5/core/mr.c| 23 +++
 drivers/vdpa/mlx5/net/mlx5_vnet.c  |  2 ++
 3 files changed, 27 insertions(+)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index 1a0d27b6e09a..50aac8fe57ef 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -37,6 +37,7 @@ struct mlx5_vdpa_mr {
bool user_mr;
 
refcount_t refcount;
+   struct list_head mr_list;
 };
 
 struct mlx5_vdpa_resources {
@@ -95,6 +96,7 @@ struct mlx5_vdpa_dev {
u32 generation;
 
struct mlx5_vdpa_mr *mr[MLX5_VDPA_NUM_AS];
+   struct list_head mr_list_head;
/* serialize mr access */
struct mutex mr_mtx;
struct mlx5_control_vq cvq;
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index c7dc8914354a..4758914ccf86 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -508,6 +508,8 @@ static void _mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev 
*mvdev, struct mlx5_vdpa_
 
vhost_iotlb_free(mr->iotlb);
 
+   list_del(&mr->mr_list);
+
kfree(mr);
 }
 
@@ -560,12 +562,31 @@ void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
mutex_unlock(&mvdev->mr_mtx);
 }
 
+static void mlx5_vdpa_show_mr_leaks(struct mlx5_vdpa_dev *mvdev)
+{
+   struct mlx5_vdpa_mr *mr;
+
+   mutex_lock(&mvdev->mr_mtx);
+
+   list_for_each_entry(mr, &mvdev->mr_list_head, mr_list) {
+
+   mlx5_vdpa_warn(mvdev, "mkey still alive after resource delete: "
+ "mr: %p, mkey: 0x%x, refcount: %u\n",
+  mr, mr->mkey, 
refcount_read(&mr->refcount));
+   }
+
+   mutex_unlock(&mvdev->mr_mtx);
+
+}
+
 void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev)
 {
for (int i = 0; i < MLX5_VDPA_NUM_AS; i++)
mlx5_vdpa_update_mr(mvdev, NULL, i);
 
prune_iotlb(mvdev->cvq.iotlb);
+
+   mlx5_vdpa_show_mr_leaks(mvdev);
 }
 
 static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
@@ -592,6 +613,8 @@ static int _mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
if (err)
goto err_iotlb;
 
+   list_add_tail(&mr->mr_list, &mvdev->mr_list_head);
+
return 0;
 
 err_iotlb:
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index f81968b3f9cf..a783e8bd784d 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -3729,6 +3729,8 @@ static int mlx5_vdpa_dev_add(struct vdpa_mgmt_dev 
*v_mdev, const char *name,
if (err)
goto err_mpfs;
 
+   INIT_LIST_HEAD(&mvdev->mr_list_head);
+
if (MLX5_CAP_GEN(mvdev->mdev, umem_uid_0)) {
err = mlx5_vdpa_create_dma_mr(mvdev);
if (err)
-- 
2.43.0

[PATCH vhost v4 14/15] vdpa/mlx5: Introduce reference counting to mrs

2023-12-19 Thread Dragos Tatulea

Deleting the old mr during mr update (.set_map) and then modifying the
vqs with the new mr is not a good flow for firmware. The firmware
expects that mkeys are deleted after there are no more vqs referencing
them.

Introduce reference counting for mrs to fix this. It is the only way to
make sure that mkeys are not in use by vqs.

An mr reference is taken when the mr is associated to the mr asid table
and when the mr is linked to the vq on create/modify. The reference is
released when the mkey is unlinked from the vq (trough modify/destroy)
and from the mr asid table.

To make things consistent, get rid of mlx5_vdpa_destroy_mr and use
get/put semantics everywhere.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  8 +++--
 drivers/vdpa/mlx5/core/mr.c| 50 --
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 45 ++-
 3 files changed, 78 insertions(+), 25 deletions(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index 84547d998bcf..1a0d27b6e09a 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -35,6 +35,8 @@ struct mlx5_vdpa_mr {
struct vhost_iotlb *iotlb;
 
bool user_mr;
+
+   refcount_t refcount;
 };
 
 struct mlx5_vdpa_resources {
@@ -118,8 +120,10 @@ int mlx5_vdpa_destroy_mkey(struct mlx5_vdpa_dev *mvdev, 
u32 mkey);
 struct mlx5_vdpa_mr *mlx5_vdpa_create_mr(struct mlx5_vdpa_dev *mvdev,
 struct vhost_iotlb *iotlb);
 void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev);
-void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev,
- struct mlx5_vdpa_mr *mr);
+void mlx5_vdpa_get_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr);
+void mlx5_vdpa_put_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr);
 void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
 struct mlx5_vdpa_mr *mr,
 unsigned int asid);
diff --git a/drivers/vdpa/mlx5/core/mr.c b/drivers/vdpa/mlx5/core/mr.c
index 2197c46e563a..c7dc8914354a 100644
--- a/drivers/vdpa/mlx5/core/mr.c
+++ b/drivers/vdpa/mlx5/core/mr.c
@@ -498,32 +498,52 @@ static void destroy_user_mr(struct mlx5_vdpa_dev *mvdev, 
struct mlx5_vdpa_mr *mr
 
 static void _mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev, struct 
mlx5_vdpa_mr *mr)
 {
+   if (WARN_ON(!mr))
+   return;
+
if (mr->user_mr)
destroy_user_mr(mvdev, mr);
else
destroy_dma_mr(mvdev, mr);
 
vhost_iotlb_free(mr->iotlb);
+
+   kfree(mr);
 }
 
-void mlx5_vdpa_destroy_mr(struct mlx5_vdpa_dev *mvdev,
- struct mlx5_vdpa_mr *mr)
+static void _mlx5_vdpa_put_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr)
 {
if (!mr)
return;
 
+   if (refcount_dec_and_test(&mr->refcount))
+   _mlx5_vdpa_destroy_mr(mvdev, mr);
+}
+
+void mlx5_vdpa_put_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr)
+{
mutex_lock(&mvdev->mr_mtx);
+   _mlx5_vdpa_put_mr(mvdev, mr);
+   mutex_unlock(&mvdev->mr_mtx);
+}
 
-   _mlx5_vdpa_destroy_mr(mvdev, mr);
+static void _mlx5_vdpa_get_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr)
+{
+   if (!mr)
+   return;
 
-   for (int i = 0; i < MLX5_VDPA_NUM_AS; i++) {
-   if (mvdev->mr[i] == mr)
-   mvdev->mr[i] = NULL;
-   }
+   refcount_inc(&mr->refcount);
+}
 
+void mlx5_vdpa_get_mr(struct mlx5_vdpa_dev *mvdev,
+ struct mlx5_vdpa_mr *mr)
+{
+   mutex_lock(&mvdev->mr_mtx);
+   _mlx5_vdpa_get_mr(mvdev, mr);
mutex_unlock(&mvdev->mr_mtx);
-
-   kfree(mr);
 }
 
 void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
@@ -534,20 +554,16 @@ void mlx5_vdpa_update_mr(struct mlx5_vdpa_dev *mvdev,
 
mutex_lock(&mvdev->mr_mtx);
 
+   _mlx5_vdpa_put_mr(mvdev, old_mr);
mvdev->mr[asid] = new_mr;
-   if (old_mr) {
-   _mlx5_vdpa_destroy_mr(mvdev, old_mr);
-   kfree(old_mr);
-   }
 
mutex_unlock(&mvdev->mr_mtx);
-
 }
 
 void mlx5_vdpa_destroy_mr_resources(struct mlx5_vdpa_dev *mvdev)
 {
for (int i = 0; i < MLX5_VDPA_NUM_AS; i++)
-   mlx5_vdpa_destroy_mr(mvdev, mvdev->mr[i]);
+   mlx5_vdpa_update_mr(mvdev, NULL, i);
 
prune_iotlb(mvdev->cvq.iotlb);
 }
@@ -607,6 +623,8 @@ struct mlx5_vdpa_mr *mlx5_vdpa_create_mr(struct 
mlx5_vdpa_dev *mvdev,
if (err)
goto out_err;
 
+   refcount_set(&mr->refcount, 1);
+
return mr;
 
 out_err:
@@ -651,7 +669,7 @@ int mlx5_vdpa_reset_mr(struct mlx5_vdpa_dev *mvdev, 
unsigned int asid)
if (asid >= MLX5_VDPA_NUM

[PATCH vhost v4 13/15] vdpa/mlx5: Use vq suspend/resume during .set_map

2023-12-19 Thread Dragos Tatulea

Instead of tearing down and setting up vq resources, use vq
suspend/resume during .set_map to speed things up a bit.

The vq mr is updated with the new mapping while the vqs are suspended.

If the device doesn't support resumable vqs, do the old teardown and
setup dance.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 46 --
 include/linux/mlx5/mlx5_ifc_vdpa.h |  1 +
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index b760005e2920..fcadbeede3e1 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1206,6 +1206,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
 {
int inlen = MLX5_ST_SZ_BYTES(modify_virtio_net_q_in);
u32 out[MLX5_ST_SZ_DW(modify_virtio_net_q_out)] = {};
+   struct mlx5_vdpa_dev *mvdev = &ndev->mvdev;
bool state_change = false;
void *obj_context;
void *cmd_hdr;
@@ -1255,6 +1256,24 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_USED_IDX)
MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, 
mvq->used_idx);
 
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY) {
+   struct mlx5_vdpa_mr *mr = 
mvdev->mr[mvdev->group2asid[MLX5_VDPA_DATAVQ_GROUP]];
+
+   if (mr)
+   MLX5_SET(virtio_q, vq_ctx, virtio_q_mkey, mr->mkey);
+   else
+   mvq->modified_fields &= 
~MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY;
+   }
+
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY) {
+   struct mlx5_vdpa_mr *mr = 
mvdev->mr[mvdev->group2asid[MLX5_VDPA_DATAVQ_DESC_GROUP]];
+
+   if (mr && MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, 
desc_group_mkey_supported))
+   MLX5_SET(virtio_q, vq_ctx, desc_group_mkey, mr->mkey);
+   else
+   mvq->modified_fields &= 
~MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY;
+   }
+
MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
mvq->modified_fields);
err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
if (err)
@@ -2791,24 +2810,35 @@ static int mlx5_vdpa_change_map(struct mlx5_vdpa_dev 
*mvdev,
unsigned int asid)
 {
struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
+   bool teardown = !is_resumable(ndev);
int err;
 
suspend_vqs(ndev);
-   err = save_channels_info(ndev);
-   if (err)
-   return err;
+   if (teardown) {
+   err = save_channels_info(ndev);
+   if (err)
+   return err;
 
-   teardown_driver(ndev);
+   teardown_driver(ndev);
+   }
 
mlx5_vdpa_update_mr(mvdev, new_mr, asid);
 
+   for (int i = 0; i < ndev->cur_num_vqs; i++)
+   ndev->vqs[i].modified_fields |= 
MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY |
+   
MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY;
+
if (!(mvdev->status & VIRTIO_CONFIG_S_DRIVER_OK) || mvdev->suspended)
return 0;
 
-   restore_channels_info(ndev);
-   err = setup_driver(mvdev);
-   if (err)
-   return err;
+   if (teardown) {
+   restore_channels_info(ndev);
+   err = setup_driver(mvdev);
+   if (err)
+   return err;
+   }
+
+   resume_vqs(ndev);
 
return 0;
 }
diff --git a/include/linux/mlx5/mlx5_ifc_vdpa.h 
b/include/linux/mlx5/mlx5_ifc_vdpa.h
index 32e712106e68..40371c916cf9 100644
--- a/include/linux/mlx5/mlx5_ifc_vdpa.h
+++ b/include/linux/mlx5/mlx5_ifc_vdpa.h
@@ -148,6 +148,7 @@ enum {
MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS   = (u64)1 << 6,
MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_AVAIL_IDX   = (u64)1 << 7,
MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_USED_IDX= (u64)1 << 8,
+   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_MKEY= (u64)1 << 11,
MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY  = (u64)1 << 14,
 };
 
-- 
2.43.0

[PATCH vhost v4 12/15] vdpa/mlx5: Mark vq state for modification in hw vq

2023-12-19 Thread Dragos Tatulea

.set_vq_state will set the indices and mark the fields to be modified in
the hw vq.

Advertise that the device supports changing the vq state when the device
is in DRIVER_OK state and suspended.

Reviewed-by: Gal Pressman 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 11 ++-
 include/linux/mlx5/mlx5_ifc_vdpa.h |  2 ++
 2 files changed, 12 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 93812683c88c..b760005e2920 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1249,6 +1249,12 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
MLX5_SET64(virtio_q, vq_ctx, available_addr, mvq->driver_addr);
}
 
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_AVAIL_IDX)
+   MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, 
mvq->avail_idx);
+
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_USED_IDX)
+   MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, 
mvq->used_idx);
+
MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
mvq->modified_fields);
err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
if (err)
@@ -2328,6 +2334,8 @@ static int mlx5_vdpa_set_vq_state(struct vdpa_device 
*vdev, u16 idx,
 
mvq->used_idx = state->split.avail_index;
mvq->avail_idx = state->split.avail_index;
+   mvq->modified_fields |= MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_AVAIL_IDX |
+   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_USED_IDX;
return 0;
 }
 
@@ -2639,7 +2647,8 @@ static u64 mlx5_vdpa_get_backend_features(const struct 
vdpa_device *vdpa)
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdpa);
 
if (MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, freeze_to_rdy_supported))
-   features |= 
BIT_ULL(VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND);
+   features |= 
BIT_ULL(VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND) |
+   
BIT_ULL(VHOST_BACKEND_F_CHANGEABLE_VQ_STATE_IN_SUSPEND);
 
return features;
 }
diff --git a/include/linux/mlx5/mlx5_ifc_vdpa.h 
b/include/linux/mlx5/mlx5_ifc_vdpa.h
index 9594ac405740..32e712106e68 100644
--- a/include/linux/mlx5/mlx5_ifc_vdpa.h
+++ b/include/linux/mlx5/mlx5_ifc_vdpa.h
@@ -146,6 +146,8 @@ enum {
MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_PARAMS  = (u64)1 << 3,
MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_DUMP_ENABLE = (u64)1 << 4,
MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS   = (u64)1 << 6,
+   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_AVAIL_IDX   = (u64)1 << 7,
+   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_USED_IDX= (u64)1 << 8,
MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY  = (u64)1 << 14,
 };
 
-- 
2.43.0

[PATCH vhost v4 11/15] vdpa/mlx5: Mark vq addrs for modification in hw vq

2023-12-19 Thread Dragos Tatulea

Addresses get set by .set_vq_address. hw vq addresses will be updated on
next modify_virtqueue.

Advertise that the device supports changing the vq addresses when
device is in DRIVER_OK state and suspended.

Reviewed-by: Gal Pressman 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 17 -
 include/linux/mlx5/mlx5_ifc_vdpa.h |  1 +
 2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index f8f088cced50..93812683c88c 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1209,6 +1209,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
bool state_change = false;
void *obj_context;
void *cmd_hdr;
+   void *vq_ctx;
void *in;
int err;
 
@@ -1230,6 +1231,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, uid, ndev->mvdev.res.uid);
 
obj_context = MLX5_ADDR_OF(modify_virtio_net_q_in, in, obj_context);
+   vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, 
virtio_q_context);
 
if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE) {
if (!is_valid_state_change(mvq->fw_state, state, 
is_resumable(ndev))) {
@@ -1241,6 +1243,12 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
state_change = true;
}
 
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS) {
+   MLX5_SET64(virtio_q, vq_ctx, desc_addr, mvq->desc_addr);
+   MLX5_SET64(virtio_q, vq_ctx, used_addr, mvq->device_addr);
+   MLX5_SET64(virtio_q, vq_ctx, available_addr, mvq->driver_addr);
+   }
+
MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
mvq->modified_fields);
err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
if (err)
@@ -2202,6 +2210,7 @@ static int mlx5_vdpa_set_vq_address(struct vdpa_device 
*vdev, u16 idx, u64 desc_
mvq->desc_addr = desc_area;
mvq->device_addr = device_area;
mvq->driver_addr = driver_area;
+   mvq->modified_fields |= MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS;
return 0;
 }
 
@@ -2626,7 +2635,13 @@ static void unregister_link_notifier(struct 
mlx5_vdpa_net *ndev)
 
 static u64 mlx5_vdpa_get_backend_features(const struct vdpa_device *vdpa)
 {
-   return BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK);
+   u64 features = BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK);
+   struct mlx5_vdpa_dev *mvdev = to_mvdev(vdpa);
+
+   if (MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, freeze_to_rdy_supported))
+   features |= 
BIT_ULL(VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND);
+
+   return features;
 }
 
 static int mlx5_vdpa_set_driver_features(struct vdpa_device *vdev, u64 
features)
diff --git a/include/linux/mlx5/mlx5_ifc_vdpa.h 
b/include/linux/mlx5/mlx5_ifc_vdpa.h
index b86d51a855f6..9594ac405740 100644
--- a/include/linux/mlx5/mlx5_ifc_vdpa.h
+++ b/include/linux/mlx5/mlx5_ifc_vdpa.h
@@ -145,6 +145,7 @@ enum {
MLX5_VIRTQ_MODIFY_MASK_STATE= (u64)1 << 0,
MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_PARAMS  = (u64)1 << 3,
MLX5_VIRTQ_MODIFY_MASK_DIRTY_BITMAP_DUMP_ENABLE = (u64)1 << 4,
+   MLX5_VIRTQ_MODIFY_MASK_VIRTIO_Q_ADDRS   = (u64)1 << 6,
MLX5_VIRTQ_MODIFY_MASK_DESC_GROUP_MKEY  = (u64)1 << 14,
 };
 
-- 
2.43.0

[PATCH vhost v4 10/15] vdpa/mlx5: Introduce per vq and device resume

2023-12-19 Thread Dragos Tatulea

Implement vdpa vq and device resume if capability detected. Add support
for suspend -> ready state change.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 69 +++
 1 file changed, 62 insertions(+), 7 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 1e08a8805640..f8f088cced50 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1170,7 +1170,12 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtqueu
return err;
 }
 
-static bool is_valid_state_change(int oldstate, int newstate)
+static bool is_resumable(struct mlx5_vdpa_net *ndev)
+{
+   return ndev->mvdev.vdev.config->resume;
+}
+
+static bool is_valid_state_change(int oldstate, int newstate, bool resumable)
 {
switch (oldstate) {
case MLX5_VIRTIO_NET_Q_OBJECT_STATE_INIT:
@@ -1178,6 +1183,7 @@ static bool is_valid_state_change(int oldstate, int 
newstate)
case MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY:
return newstate == MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND;
case MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND:
+   return resumable ? newstate == 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY : false;
case MLX5_VIRTIO_NET_Q_OBJECT_STATE_ERR:
default:
return false;
@@ -1200,6 +1206,7 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
 {
int inlen = MLX5_ST_SZ_BYTES(modify_virtio_net_q_in);
u32 out[MLX5_ST_SZ_DW(modify_virtio_net_q_out)] = {};
+   bool state_change = false;
void *obj_context;
void *cmd_hdr;
void *in;
@@ -1211,9 +1218,6 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
if (!modifiable_virtqueue_fields(mvq))
return -EINVAL;
 
-   if (!is_valid_state_change(mvq->fw_state, state))
-   return -EINVAL;
-
in = kzalloc(inlen, GFP_KERNEL);
if (!in)
return -ENOMEM;
@@ -1226,17 +1230,29 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, uid, ndev->mvdev.res.uid);
 
obj_context = MLX5_ADDR_OF(modify_virtio_net_q_in, in, obj_context);
-   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE)
+
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE) {
+   if (!is_valid_state_change(mvq->fw_state, state, 
is_resumable(ndev))) {
+   err = -EINVAL;
+   goto done;
+   }
+
MLX5_SET(virtio_net_q_object, obj_context, state, state);
+   state_change = true;
+   }
 
MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
mvq->modified_fields);
err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
-   kfree(in);
-   if (!err)
+   if (err)
+   goto done;
+
+   if (state_change)
mvq->fw_state = state;
 
mvq->modified_fields = 0;
 
+done:
+   kfree(in);
return err;
 }
 
@@ -1430,6 +1446,24 @@ static void suspend_vqs(struct mlx5_vdpa_net *ndev)
suspend_vq(ndev, &ndev->vqs[i]);
 }
 
+static void resume_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue 
*mvq)
+{
+   if (!mvq->initialized || !is_resumable(ndev))
+   return;
+
+   if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND)
+   return;
+
+   if (modify_virtqueue_state(ndev, mvq, 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY))
+   mlx5_vdpa_warn(&ndev->mvdev, "modify to resume failed for vq 
%u\n", mvq->index);
+}
+
+static void resume_vqs(struct mlx5_vdpa_net *ndev)
+{
+   for (int i = 0; i < ndev->mvdev.max_vqs; i++)
+   resume_vq(ndev, &ndev->vqs[i]);
+}
+
 static void teardown_vq(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue 
*mvq)
 {
if (!mvq->initialized)
@@ -3261,6 +3295,23 @@ static int mlx5_vdpa_suspend(struct vdpa_device *vdev)
return 0;
 }
 
+static int mlx5_vdpa_resume(struct vdpa_device *vdev)
+{
+   struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
+   struct mlx5_vdpa_net *ndev;
+
+   ndev = to_mlx5_vdpa_ndev(mvdev);
+
+   mlx5_vdpa_info(mvdev, "resuming device\n");
+
+   down_write(&ndev->reslock);
+   mvdev->suspended = false;
+   resume_vqs(ndev);
+   register_link_notifier(ndev);
+   up_write(&ndev->reslock);
+   return 0;
+}
+
 static int mlx5_set_group_asid(struct vdpa_device *vdev, u32 group,
   unsigned int asid)
 {
@@ -3317,6 +3368,7 @@ static const struct vdpa_config_ops mlx5_vdpa_ops = {
.get_vq_dma_dev = mlx5_get_vq_dma_dev,
.free = mlx5_vdpa_free,
.suspend = mlx5_vdpa_suspend,
+   .resume = mlx5_vdpa_resume, /* Op disabled if not supported. */
 };
 
 static int query_mtu(struct mlx5_

[PATCH vhost v4 08/15] vdpa: Block vq state change in DRIVER_OK unless device supports it

2023-12-19 Thread Dragos Tatulea

Virtqueue state change during DRIVE_OK is not supported by the virtio
standard. Allow this op in DRIVER_OK only for devices that support
changing the state during DRIVER_OK if the device is suspended.

Signed-off-by: Dragos Tatulea 
Suggested-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 6bfa3391935a..77509440c723 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -641,6 +641,9 @@ static bool vhost_vdpa_vq_config_allowed(struct vhost_vdpa 
*v, unsigned int cmd)
case VHOST_SET_VRING_ADDR:
feature = VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND;
break;
+   case VHOST_SET_VRING_BASE:
+   feature = VHOST_BACKEND_F_CHANGEABLE_VQ_STATE_IN_SUSPEND;
+   break;
default:
return false;
}
@@ -737,6 +740,9 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
unsigned int cmd,
break;
 
case VHOST_SET_VRING_BASE:
+   if (!vhost_vdpa_vq_config_allowed(v, cmd))
+   return -EOPNOTSUPP;
+
if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
vq_state.packed.last_avail_idx = vq->last_avail_idx & 
0x7fff;
vq_state.packed.last_avail_counter = 
!!(vq->last_avail_idx & 0x8000);
-- 
2.43.0

[PATCH vhost v4 09/15] vdpa/mlx5: Allow modifying multiple vq fields in one modify command

2023-12-19 Thread Dragos Tatulea

Add a bitmask variable that tracks hw vq field changes that
are supposed to be modified on next hw vq change command.

This will be useful to set multiple vq fields when resuming the vq.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 48 +--
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 26ba7da6b410..1e08a8805640 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -120,6 +120,9 @@ struct mlx5_vdpa_virtqueue {
u16 avail_idx;
u16 used_idx;
int fw_state;
+
+   u64 modified_fields;
+
struct msi_map map;
 
/* keep last in the struct */
@@ -1181,7 +1184,19 @@ static bool is_valid_state_change(int oldstate, int 
newstate)
}
 }
 
-static int modify_virtqueue(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *mvq, int state)
+static bool modifiable_virtqueue_fields(struct mlx5_vdpa_virtqueue *mvq)
+{
+   /* Only state is always modifiable */
+   if (mvq->modified_fields & ~MLX5_VIRTQ_MODIFY_MASK_STATE)
+   return mvq->fw_state == MLX5_VIRTIO_NET_Q_OBJECT_STATE_INIT ||
+  mvq->fw_state == MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND;
+
+   return true;
+}
+
+static int modify_virtqueue(struct mlx5_vdpa_net *ndev,
+   struct mlx5_vdpa_virtqueue *mvq,
+   int state)
 {
int inlen = MLX5_ST_SZ_BYTES(modify_virtio_net_q_in);
u32 out[MLX5_ST_SZ_DW(modify_virtio_net_q_out)] = {};
@@ -1193,6 +1208,9 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtque
if (mvq->fw_state == MLX5_VIRTIO_NET_Q_OBJECT_NONE)
return 0;
 
+   if (!modifiable_virtqueue_fields(mvq))
+   return -EINVAL;
+
if (!is_valid_state_change(mvq->fw_state, state))
return -EINVAL;
 
@@ -1208,17 +1226,28 @@ static int modify_virtqueue(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtque
MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, uid, ndev->mvdev.res.uid);
 
obj_context = MLX5_ADDR_OF(modify_virtio_net_q_in, in, obj_context);
-   MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select,
-  MLX5_VIRTQ_MODIFY_MASK_STATE);
-   MLX5_SET(virtio_net_q_object, obj_context, state, state);
+   if (mvq->modified_fields & MLX5_VIRTQ_MODIFY_MASK_STATE)
+   MLX5_SET(virtio_net_q_object, obj_context, state, state);
+
+   MLX5_SET64(virtio_net_q_object, obj_context, modify_field_select, 
mvq->modified_fields);
err = mlx5_cmd_exec(ndev->mvdev.mdev, in, inlen, out, sizeof(out));
kfree(in);
if (!err)
mvq->fw_state = state;
 
+   mvq->modified_fields = 0;
+
return err;
 }
 
+static int modify_virtqueue_state(struct mlx5_vdpa_net *ndev,
+ struct mlx5_vdpa_virtqueue *mvq,
+ unsigned int state)
+{
+   mvq->modified_fields |= MLX5_VIRTQ_MODIFY_MASK_STATE;
+   return modify_virtqueue(ndev, mvq, state);
+}
+
 static int counter_set_alloc(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *mvq)
 {
u32 in[MLX5_ST_SZ_DW(create_virtio_q_counters_in)] = {};
@@ -1347,7 +1376,7 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *mvq)
goto err_vq;
 
if (mvq->ready) {
-   err = modify_virtqueue(ndev, mvq, 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY);
+   err = modify_virtqueue_state(ndev, mvq, 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY);
if (err) {
mlx5_vdpa_warn(&ndev->mvdev, "failed to modify to ready 
vq idx %d(%d)\n",
   idx, err);
@@ -1382,7 +1411,7 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *m
if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY)
return;
 
-   if (modify_virtqueue(ndev, mvq, MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
+   if (modify_virtqueue_state(ndev, mvq, 
MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n");
 
if (query_virtqueue(ndev, mvq, &attr)) {
@@ -1407,6 +1436,7 @@ static void teardown_vq(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtqueue *
return;
 
suspend_vq(ndev, mvq);
+   mvq->modified_fields = 0;
destroy_virtqueue(ndev, mvq);
dealloc_vector(ndev, mvq);
counter_set_dealloc(ndev, mvq);
@@ -2207,7 +2237,7 @@ static void mlx5_vdpa_set_vq_ready(struct vdpa_device 
*vdev, u16 idx, bool ready
if (!ready) {
suspend_vq(ndev, mvq);
} else {
-   err = modify_virtqueue(ndev, mvq, 
MLX5_VIRTIO_NET_Q_OBJECT_STATE

[PATCH vhost v4 07/15] vdpa: Block vq address change in DRIVER_OK unless device supports it

2023-12-19 Thread Dragos Tatulea

Virtqueue address change during DRIVE_OK is not supported by the virtio
standard. Allow this op in DRIVER_OK only for devices that support
changing the address during DRIVER_OK if the device is suspended.

Signed-off-by: Dragos Tatulea 
Suggested-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 00b4fa8e89f2..6bfa3391935a 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -625,6 +625,29 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
return ret;
 }
 
+static bool vhost_vdpa_vq_config_allowed(struct vhost_vdpa *v, unsigned int 
cmd)
+{
+   struct vdpa_device *vdpa = v->vdpa;
+   const struct vdpa_config_ops *ops = vdpa->config;
+   int feature;
+
+   if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
+   return true;
+
+   if (!v->suspended)
+   return false;
+
+   switch (cmd) {
+   case VHOST_SET_VRING_ADDR:
+   feature = VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND;
+   break;
+   default:
+   return false;
+   }
+
+   return v->vdev.vqs && vhost_backend_has_feature(v->vdev.vqs[0], 
feature);
+}
+
 static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
   void __user *argp)
 {
@@ -703,6 +726,9 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
unsigned int cmd,
 
switch (cmd) {
case VHOST_SET_VRING_ADDR:
+   if (!vhost_vdpa_vq_config_allowed(v, cmd))
+   return -EOPNOTSUPP;
+
if (ops->set_vq_address(vdpa, idx,
(u64)(uintptr_t)vq->desc,
(u64)(uintptr_t)vq->avail,
-- 
2.43.0

[PATCH vhost v4 05/15] vdpa: Accept VHOST_BACKEND_F_CHANGEABLE_VQ_STATE_IN_SUSPEND backend feature

2023-12-19 Thread Dragos Tatulea

If userland sets this feature, allow it.

Signed-off-by: Dragos Tatulea 
Suggested-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 2250fcd90e5b..b4e8ddf86485 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -750,6 +750,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 BIT_ULL(VHOST_BACKEND_F_RESUME) |
 
BIT_ULL(VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND) |
+
BIT_ULL(VHOST_BACKEND_F_CHANGEABLE_VQ_STATE_IN_SUSPEND) |
 
BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
return -EOPNOTSUPP;
if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
-- 
2.43.0

[PATCH vhost v4 06/15] vdpa: Track device suspended state

2023-12-19 Thread Dragos Tatulea

Set vdpa device suspended state on successful suspend. Clear it on
successful resume and reset.

The state will be locked by the vhost_vdpa mutex. The mutex is taken
during suspend, resume and reset in vhost_vdpa_unlocked_ioctl. The
exception is vhost_vdpa_open which does a device reset but that should
be safe because it can only happen before the other ops.

Signed-off-by: Dragos Tatulea 
Suggested-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index b4e8ddf86485..00b4fa8e89f2 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -59,6 +59,7 @@ struct vhost_vdpa {
int in_batch;
struct vdpa_iova_range range;
u32 batch_asid;
+   bool suspended;
 };
 
 static DEFINE_IDA(vhost_vdpa_ida);
@@ -232,6 +233,8 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
struct vdpa_device *vdpa = v->vdpa;
u32 flags = 0;
 
+   v->suspended = false;
+
if (v->vdev.vqs) {
flags |= !vhost_backend_has_feature(v->vdev.vqs[0],

VHOST_BACKEND_F_IOTLB_PERSIST) ?
@@ -590,11 +593,16 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
 {
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
+   int ret;
 
if (!ops->suspend)
return -EOPNOTSUPP;
 
-   return ops->suspend(vdpa);
+   ret = ops->suspend(vdpa);
+   if (!ret)
+   v->suspended = true;
+
+   return ret;
 }
 
 /* After a successful return of this ioctl the device resumes processing
@@ -605,11 +613,16 @@ static long vhost_vdpa_resume(struct vhost_vdpa *v)
 {
struct vdpa_device *vdpa = v->vdpa;
const struct vdpa_config_ops *ops = vdpa->config;
+   int ret;
 
if (!ops->resume)
return -EOPNOTSUPP;
 
-   return ops->resume(vdpa);
+   ret = ops->resume(vdpa);
+   if (!ret)
+   v->suspended = false;
+
+   return ret;
 }
 
 static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, unsigned int cmd,
-- 
2.43.0

[PATCH vhost v4 04/15] vdpa: Accept VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND backend feature

2023-12-19 Thread Dragos Tatulea

If userland sets this feature, allow it.

Signed-off-by: Dragos Tatulea 
Suggested-by: Eugenio Pérez 
---
 drivers/vhost/vdpa.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index da7ec77cdaff..2250fcd90e5b 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -749,6 +749,7 @@ static long vhost_vdpa_unlocked_ioctl(struct file *filep,
 BIT_ULL(VHOST_BACKEND_F_IOTLB_PERSIST) |
 BIT_ULL(VHOST_BACKEND_F_SUSPEND) |
 BIT_ULL(VHOST_BACKEND_F_RESUME) |
+
BIT_ULL(VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND) |
 
BIT_ULL(VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK)))
return -EOPNOTSUPP;
if ((features & BIT_ULL(VHOST_BACKEND_F_SUSPEND)) &&
-- 
2.43.0

[PATCH vhost v4 03/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_STATE_IN_SUSPEND flag

2023-12-19 Thread Dragos Tatulea

The virtio spec doesn't allow changing virtqueue state after
DRIVER_OK. Some devices do support this operation when the device is
suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_STATE_IN_SUSPEND flag
advertises this support as a backend features.

Signed-off-by: Dragos Tatulea 
Suggested-by: Eugenio Pérez 
---
 include/uapi/linux/vhost_types.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index aacd067afc89..848dadc95ca1 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -196,5 +196,9 @@ struct vhost_vdpa_iova_range {
  * and is in state DRIVER_OK.
  */
 #define VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND  0x9
+/* Device supports changing virtqueue state when device is suspended
+ * and is in state DRIVER_OK.
+ */
+#define VHOST_BACKEND_F_CHANGEABLE_VQ_STATE_IN_SUSPEND  0x10
 
 #endif
-- 
2.43.0

[PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-19 Thread Dragos Tatulea

The virtio spec doesn't allow changing virtqueue addresses after
DRIVER_OK. Some devices do support this operation when the device is
suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
advertises this support as a backend features.

Signed-off-by: Dragos Tatulea 
Suggested-by: Eugenio Pérez 
---
 include/uapi/linux/vhost_types.h | 4 
 1 file changed, 4 insertions(+)

diff --git a/include/uapi/linux/vhost_types.h b/include/uapi/linux/vhost_types.h
index d7656908f730..aacd067afc89 100644
--- a/include/uapi/linux/vhost_types.h
+++ b/include/uapi/linux/vhost_types.h
@@ -192,5 +192,9 @@ struct vhost_vdpa_iova_range {
 #define VHOST_BACKEND_F_DESC_ASID0x7
 /* IOTLB don't flush memory mapping across device reset */
 #define VHOST_BACKEND_F_IOTLB_PERSIST  0x8
+/* Device supports changing virtqueue addresses when device is suspended
+ * and is in state DRIVER_OK.
+ */
+#define VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND  0x9
 
 #endif
-- 
2.43.0

[PATCH mlx5-vhost v4 01/15] vdpa/mlx5: Expose resumable vq capability

2023-12-19 Thread Dragos Tatulea

Necessary for checking if resumable vqs are supported by the hardware.
Actual support will be added in a downstream patch.

Reviewed-by: Gal Pressman 
Acked-by: Eugenio Pérez 
Signed-off-by: Dragos Tatulea 
---
 include/linux/mlx5/mlx5_ifc.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index 6f3631425f38..9eaceaf6bcb0 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -1236,7 +1236,8 @@ struct mlx5_ifc_virtio_emulation_cap_bits {
 
u8 reserved_at_c0[0x13];
u8 desc_group_mkey_supported[0x1];
-   u8 reserved_at_d4[0xc];
+   u8 freeze_to_rdy_supported[0x1];
+   u8 reserved_at_d5[0xb];
 
u8 reserved_at_e0[0x20];
 
-- 
2.43.0

[PATCH vhost v4 00/15] vdpa/mlx5: Add support for resumable vqs

2023-12-19 Thread Dragos Tatulea

Add support for resumable vqs in the mlx5_vdpa driver. This is a
firmware feature that can be used for the following benefits:
- Full device .suspend/.resume.
- .set_map doesn't need to destroy and create new vqs anymore just to
  update the map. When resumable vqs are supported it is enough to
  suspend the vqs, set the new maps, and then resume the vqs.

The first patch exposes relevant bits for the feature in mlx5_ifc.h.
That means it needs to be applied to the mlx5-vhost tree [0] first. Once
applied there, the change has to be pulled from mlx5-vhost into the
vhost tree and only then the remaining patches can be applied. Same flow
as the vq descriptor mappings patchset [1].

The second part implements the vdpa backend feature support to allow
vq state and address changes when the device is in DRIVER_OK state and
suspended.

The third part adds support for seletively modifying vq parameters. This
is needed to be able to use resumable vqs.

Then the actual support for resumable vqs is added.

The last part of the series introduces reference counting for mrs which
is necessary to avoid freeing mkeys too early or leaking them.

* Changes in v4:
- Added vdpa backend feature support for changing vq properties in
  DRIVER_OK when device is suspended. Added support in the driver as
  well.
- Dropped Acked-by for the patches that had the tag mistakenly
  added.

* Changes in v3:
- Faulty version. Please ignore.

* Changes in v2:
- Added mr refcounting patches.
- Deleted unnecessary patch: "vdpa/mlx5: Split function into locked and
  unlocked variants"
- Small print improvement in "Introduce per vq and device resume"
  patch.
- Patch 1/7 has been applied to mlx5-vhost branch.


Dragos Tatulea (15):
  vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
  vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_STATE_IN_SUSPEND flag
  vdpa: Accept VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND backend
feature
  vdpa: Accept VHOST_BACKEND_F_CHANGEABLE_VQ_STATE_IN_SUSPEND backend
feature
  vdpa: Track device suspended state
  vdpa: Block vq address change in DRIVER_OK unless device supports it
  vdpa: Block vq state change in DRIVER_OK unless device supports it
  vdpa/mlx5: Expose resumable vq capability
  vdpa/mlx5: Allow modifying multiple vq fields in one modify command
  vdpa/mlx5: Introduce per vq and device resume
  vdpa/mlx5: Mark vq addrs for modification in hw vq
  vdpa/mlx5: Mark vq state for modification in hw vq
  vdpa/mlx5: Use vq suspend/resume during .set_map
  vdpa/mlx5: Introduce reference counting to mrs
  vdpa/mlx5: Add mkey leak detection

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  10 +-
 drivers/vdpa/mlx5/core/mr.c|  69 +++--
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 218 ++---
 drivers/vhost/vdpa.c   |  51 ++-
 include/linux/mlx5/mlx5_ifc.h  |   3 +-
 include/linux/mlx5/mlx5_ifc_vdpa.h |   4 +
 include/uapi/linux/vhost_types.h   |   8 ++
 7 files changed, 322 insertions(+), 41 deletions(-)

-- 
2.43.0

Re: [PATCH] ring-buffer: Fix slowpath of interrupted event

2023-12-19 Thread Steven Rostedt

On Tue, 19 Dec 2023 10:36:13 -0500
Steven Rostedt  wrote:

>  |-- interrupt event --|-- normal context event --|-- interrupt event --|
> 
>  ^^ ^
>  || |
> ts is before before_stamp | |
>our before_stamp |
> absolute value
> 
> We just need to make our delta not go beyond the absolute value. So:
> 
>   ts of first event + (absolute value - our before_stamp)
> 
> Should not be greater than the absolute value.

Hmm, this isn't good enough either. Because we could have had the interrupt
happen *after* the before_stamp update, and leave the write_stamp not equal
to the before_stamp (like the original patch is fixing).

That is, in the case of finding an absolute value, there's still no way to
know what delta to use without walking the sub-buffer. Hmm, since this is
so rare, we could do that :-/

Anyway, for now, it's just going to be delta = 0, and we could do the
sub-buffer walk if it becomes an issue (which it hasn't for the last decade
or more).

-- Steve

Re: [PATCH v5 06/34] function_graph: Allow multiple users to attach to function graph

2023-12-19 Thread Google

On Tue, 19 Dec 2023 14:23:37 +0100
Jiri Olsa  wrote:

> On Mon, Dec 18, 2023 at 10:12:45PM +0900, Masami Hiramatsu (Google) wrote:
> 
> SNIP
> 
> >  /* Both enabled by default (can be cleared by function_graph tracer flags 
> > */
> >  static bool fgraph_sleep_time = true;
> >  
> > @@ -126,9 +247,34 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> > long func,
> > calltime = trace_clock_local();
> >  
> > index = current->curr_ret_stack;
> > -   RET_STACK_INC(current->curr_ret_stack);
> > +   /* ret offset = 1 ; type = reserved */
> > +   current->ret_stack[index + FGRAPH_RET_INDEX] = 1;
> > ret_stack = RET_STACK(current, index);
> > +   ret_stack->ret = ret;
> > +   /*
> > +* The unwinders expect curr_ret_stack to point to either zero
> > +* or an index where to find the next ret_stack. Even though the
> > +* ret stack might be bogus, we want to write the ret and the
> > +* index to find the ret_stack before we increment the stack point.
> > +* If an interrupt comes in now before we increment the curr_ret_stack
> > +* it may blow away what we wrote. But that's fine, because the
> > +* index will still be correct (even though the 'ret' won't be).
> > +* What we worry about is the index being correct after we increment
> > +* the curr_ret_stack and before we update that index, as if an
> > +* interrupt comes in and does an unwind stack dump, it will need
> > +* at least a correct index!
> > +*/
> > barrier();
> > +   current->curr_ret_stack += FGRAPH_RET_INDEX + 1;
> > +   /*
> > +* This next barrier is to ensure that an interrupt coming in
> > +* will not corrupt what we are about to write.
> > +*/
> > +   barrier();
> > +
> > +   /* Still keep it reserved even if an interrupt came in */
> > +   current->ret_stack[index + FGRAPH_RET_INDEX] = 1;
> 
> seems like this was set already few lines above?

Yeah, there is a trick (trap) for interrupts between writes. We can not do
atomically write the last stack entry and increment stack index. But it must
be done for shadow unwinding insinde interrupts. Thus,

(1) write a reserve type entry on the new stack entry
(2) increment curr_ret_stack to the new stack entry
(3) rewrite the new stack entry again

If an interrupt happens between (1) and (2), stack unwinder can find the
correct latest shadow stack frame from the curr_ret_stack. This interrupts
can store their shadow stack so... wait something went wrong.

If the interrupt *overwrites* the shadow stack and (3) recovers it,
if another interrupt before (3), the shadow stack will be corrupted...

OK, I think we need a "rsrv_ret_stack" index. Basically new one will do;

(1) increment rsrv_ret_stack
(2) write a reserve type entry
(3) set curr_ret_stack = rsrv_ret_stack

And before those,

(0) if rsrv_ret_stack != curr_ret_stack, write a reserve type entry at
rsrv_ret_stack for the previous frame (which offset can be read
from curr_ret_stack)

Than it will never be broken.
(of course when decrement curr_ret_stack, rsrv_ret_stack is also decremented)

Thank you,

> 
> jirka
> 
> > +
> > ret_stack->ret = ret;
> > ret_stack->func = func;
> > ret_stack->calltime = calltime;
> > @@ -159,6 +305,12 @@ int function_graph_enter(unsigned long ret, unsigned 
> > long func,
> >  unsigned long frame_pointer, unsigned long *retp)
> >  {
> > struct ftrace_graph_ent trace;
> > +   int offset;
> > +   int start;
> > +   int type;
> > +   int val;
> > +   int cnt = 0;
> > +   int i;
> >  
> >  #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
> > /*
> 
> SNIP
> 


-- 
Masami Hiramatsu (Google)

Re: [PATCH v2] virtio_balloon: stay awake while adjusting balloon

2023-12-19 Thread David Hildenbrand


On 19.12.23 15:37, David Stevens wrote:

On Mon, Dec 18, 2023 at 12:33 PM David Hildenbrand  wrote:


On 18.12.23 16:18, David Stevens wrote:

From: David Stevens 

A virtio_balloon's parent device may be configured so that a
configuration change interrupt is a wakeup event. Extend the processing
of such a wakeup event until the balloon finishes inflating or deflating
by calling pm_stay_awake/pm_relax in the virtio_balloon driver. Note
that these calls are no-ops if the parent device doesn't support wakeup
events or if the wakeup events are not enabled.

This change allows the guest to use system power states such as s2idle
without running the risk the virtio_balloon's cooperative memory
management becoming unresponsive to the host's requests.

Signed-off-by: David Stevens 
---
v1 -> v2:
   - Use adjustment_signal_pending flag instead of a sequence number
   - Call pm_stay_awake/pm_relax on parent device instead of adding a wake
 event to the virtio balloon device

   drivers/virtio/virtio_balloon.c | 57 +++--
   1 file changed, 47 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 1fe93e93f5bc..a3c11159cbe0 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -119,6 +119,11 @@ struct virtio_balloon {
   /* Free page reporting device */
   struct virtqueue *reporting_vq;
   struct page_reporting_dev_info pr_dev_info;
+
+ /* State for keeping the wakeup_source active while adjusting the balloon 
*/
+ spinlock_t adjustment_lock;
+ bool adjustment_signal_pending;
+ bool adjustment_in_progress;
   };

   static const struct virtio_device_id id_table[] = {
@@ -437,6 +442,31 @@ static void virtio_balloon_queue_free_page_work(struct 
virtio_balloon *vb)
   queue_work(vb->balloon_wq, &vb->report_free_page_work);
   }

+static void start_update_balloon_size(struct virtio_balloon *vb)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&vb->adjustment_lock, flags);
+ vb->adjustment_signal_pending = true;
+ if (!vb->adjustment_in_progress) {
+ vb->adjustment_in_progress = true;
+ pm_stay_awake(vb->vdev->dev.parent);
+ }
+ spin_unlock_irqrestore(&vb->adjustment_lock, flags);
+
+ queue_work(system_freezable_wq, &vb->update_balloon_size_work);
+}
+
+static void end_update_balloon_size(struct virtio_balloon *vb)
+{
+ spin_lock(&vb->adjustment_lock);
+ if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {


How could vb->adjustment_in_progress ever not be set at this point?


+ vb->adjustment_in_progress = false;
+ pm_relax(vb->vdev->dev.parent);
+ }
+ spin_unlock(&vb->adjustment_lock);
+}
+


LGTM, although I wonder what happens when calling pm_stay_awake() etc.
on a parent device that is not wakeup-even-capable?


If the parent device is not wakeup capable or if wakeup isn't enabled,
then the vb->vdev->dev.parent->power.wakeup pointer will be NULL, so
the NULL checks in __pm_relax/__pm_stay_awake will immediately return.


Ah, I missed that NULL check, makes sense. Thanks!

--
Cheers,

David / dhildenb

Re: [PATCH] ring-buffer: Fix slowpath of interrupted event

2023-12-19 Thread Steven Rostedt

On Tue, 19 Dec 2023 10:10:27 -0500
Steven Rostedt  wrote:

>  1000 - interrupt event
>  2000 - normal context event
>  2100 - next normal context event
> 
> Where we see the delta between the interrupt event and the normal context
> event was 1000. But if we just had it be delta = 0, it would be:
> 
>  1000 - interrupt event
>  1000 - normal context event
>  2100 - next normal context event
> 

Also, let me show the rare case of where we do have delta=0. That would be
an interrupt event coming in before *and* after the normal context event
was allocated. That is, we have:

 |-- interrupt event --|-- normal context event --|-- interrupt event --|

And that could look like:

  1000 - interrupt event
  1000 - normal context event
  2100 - next interrupt event

It may look like the normal context ran for 1000 before the next interrupt
event, but in reality it happened sometime between the two.

Hmm, in this case, what we could possibly do is to read the absolute
timestamp *after* the interrupted event!

That is, we detected we are here:

 |-- interrupt event --|-- normal context event --|-- interrupt event --|

   ^
   |
   |

And we know that the write pointer is past this event. We can look at the
next event and see what its timestamp is. It is either a delta or a
absolute timestamp. It is a delta if the previous interrupt came in between
B and C, as that event would have made before_stamp == write_stamp. Or it
is an absolute delta if this event updated the before_stamp making it no
longer match the write_stamp of the previous event.

If it is an absolute timestamp, it means the interrupt came in before our
update of the before stamp, and we could make our delta:

absolute value of next timestamp - this event before_stamp

As we have:

 |-- interrupt event --|-- normal context event --|-- interrupt event --|

 ^^ ^
 || |
ts is before before_stamp | |
   our before_stamp |
absolute value

We just need to make our delta not go beyond the absolute value. So:

  ts of first event + (absolute value - our before_stamp)

Should not be greater than the absolute value.

If the next event has a delta, we could make this event delta equal to it,
and then update the next event to have a delta of zero, which would give us:

  1000 - interrupt event
  2100 - normal context event
  2100 - next interrupt event

Which is much more realistic to what happened.

But all this is for another time ;-)

-- Steve

Re: [PATCH v6 2/4] dax/bus: Use guard(device) in sysfs attribute helpers

2023-12-19 Thread Jonathan Cameron

On Thu, 14 Dec 2023 22:25:27 -0700
Vishal Verma  wrote:

> Use the guard(device) macro to lock a 'struct device', and unlock it
> automatically when going out of scope using Scope Based Resource
> Management semantics. A lot of the sysfs attribute writes in
> drivers/dax/bus.c benefit from a cleanup using these, so change these
> where applicable.
> 
> Cc: Joao Martins 
> Cc: Dan Williams 
> Signed-off-by: Vishal Verma 
Hi Vishal,

A few really minor suggestions inline if you happen to be doing a v7.
Either way
Reviewed-by: Jonathan Cameron 

>  
> @@ -481,12 +466,9 @@ static int __free_dev_dax_id(struct dev_dax *dev_dax)
>  static int free_dev_dax_id(struct dev_dax *dev_dax)
>  {
>   struct device *dev = &dev_dax->dev;
> - int rc;
>  
> - device_lock(dev);
> - rc = __free_dev_dax_id(dev_dax);
> - device_unlock(dev);
> - return rc;
> + guard(device)(dev);

guard(device)(&dev_dax->dev); /* Only one user now */

> + return __free_dev_dax_id(dev_dax);
>  }
>  
>  static int alloc_dev_dax_id(struct dev_dax *dev_dax)
> @@ -908,9 +890,8 @@ static ssize_t size_show(struct device *dev,
>   struct dev_dax *dev_dax = to_dev_dax(dev);
>   unsigned long long size;
>  
> - device_lock(dev);
> + guard(device)(dev);
>   size = dev_dax_size(dev_dax);
> - device_unlock(dev);
>  
>   return sprintf(buf, "%llu\n", size);
Might as well make this

guard(device)(dev);
return sprintf(buf, "%llu\n", dev_dax_size(to_dev_dax(dev));

>  }

Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq

2023-12-19 Thread Dragos Tatulea

On Tue, 2023-12-19 at 15:02 +0100, Eugenio Perez Martin wrote:
> On Tue, Dec 19, 2023 at 12:16 PM Dragos Tatulea  wrote:
> > 
> > On Tue, 2023-12-19 at 08:24 +0100, Eugenio Perez Martin wrote:
> > > On Mon, Dec 18, 2023 at 2:58 PM Dragos Tatulea  
> > > wrote:
> > > > 
> > > > On Mon, 2023-12-18 at 13:06 +0100, Eugenio Perez Martin wrote:
> > > > > On Mon, Dec 18, 2023 at 11:52 AM Dragos Tatulea  
> > > > > wrote:
> > > > > > 
> > > > > > On Mon, 2023-12-18 at 11:16 +0100, Eugenio Perez Martin wrote:
> > > > > > > On Sat, Dec 16, 2023 at 12:03 PM Dragos Tatulea 
> > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > On Fri, 2023-12-15 at 18:56 +0100, Eugenio Perez Martin wrote:
> > > > > > > > > On Fri, Dec 15, 2023 at 3:13 PM Dragos Tatulea 
> > > > > > > > >  wrote:
> > > > > > > > > > 
> > > > > > > > > > On Fri, 2023-12-15 at 12:35 +, Dragos Tatulea wrote:
> > > > > > > > > > > On Thu, 2023-12-14 at 19:30 +0100, Eugenio Perez Martin 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Thu, Dec 14, 2023 at 4:51 PM Dragos Tatulea 
> > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Thu, 2023-12-14 at 08:45 -0500, Michael S. Tsirkin 
> > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > On Thu, Dec 14, 2023 at 01:39:55PM +, Dragos 
> > > > > > > > > > > > > > Tatulea wrote:
> > > > > > > > > > > > > > > On Tue, 2023-12-12 at 15:44 -0800, Si-Wei Liu 
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > On 12/12/2023 11:21 AM, Eugenio Perez Martin 
> > > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > > On Tue, Dec 5, 2023 at 11:46 AM Dragos 
> > > > > > > > > > > > > > > > > Tatulea  wrote:
> > > > > > > > > > > > > > > > > > Addresses get set by .set_vq_address. hw vq 
> > > > > > > > > > > > > > > > > > addresses will be updated on
> > > > > > > > > > > > > > > > > > next modify_virtqueue.
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Signed-off-by: Dragos Tatulea 
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > > Reviewed-by: Gal Pressman 
> > > > > > > > > > > > > > > > > > Acked-by: Eugenio Pérez 
> > > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > I'm kind of ok with this patch and the next 
> > > > > > > > > > > > > > > > > one about state, but I
> > > > > > > > > > > > > > > > > didn't ack them in the previous series.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > My main concern is that it is not valid to 
> > > > > > > > > > > > > > > > > change the vq address after
> > > > > > > > > > > > > > > > > DRIVER_OK in VirtIO, which vDPA follows. Only 
> > > > > > > > > > > > > > > > > memory maps are ok to
> > > > > > > > > > > > > > > > > change at this moment. I'm not sure about vq 
> > > > > > > > > > > > > > > > > state in vDPA, but vhost
> > > > > > > > > > > > > > > > > forbids changing it with an active backend.
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Suspend is not defined in VirtIO at this 
> > > > > > > > > > > > > > > > > moment though, so maybe it is
> > > > > > > > > > > > > > > > > ok to decide that all of these parameters may 
> > > > > > > > > > > > > > > > > change during suspend.
> > > > > > > > > > > > > > > > > Maybe the best thing is to protect this with 
> > > > > > > > > > > > > > > > > a vDPA feature flag.
> > > > > > > > > > > > > > > > I think protect with vDPA feature flag could 
> > > > > > > > > > > > > > > > work, while on the other
> > > > > > > > > > > > > > > > hand vDPA means vendor specific optimization is 
> > > > > > > > > > > > > > > > possible around suspend
> > > > > > > > > > > > > > > > and resume (in case it helps performance), 
> > > > > > > > > > > > > > > > which doesn't have to be
> > > > > > > > > > > > > > > > backed by virtio spec. Same applies to vhost 
> > > > > > > > > > > > > > > > user backend features,
> > > > > > > > > > > > > > > > variations there were not backed by spec 
> > > > > > > > > > > > > > > > either. Of course, we should
> > > > > > > > > > > > > > > > try best to make the default behavior backward 
> > > > > > > > > > > > > > > > compatible with
> > > > > > > > > > > > > > > > virtio-based backend, but that circles back to 
> > > > > > > > > > > > > > > > no suspend definition in
> > > > > > > > > > > > > > > > the current virtio spec, for which I hope we 
> > > > > > > > > > > > > > > > don't cease development on
> > > > > > > > > > > > > > > > vDPA indefinitely. After all, the virtio based 
> > > > > > > > > > > > > > > > vdap backend can well
> > > > > > > > > > > > > > > > define its own feature flag to describe (minor 
> > > > > > > > > > > > > > > > difference in) the
> > > > > > > > > > > > > > > > suspend behavior based on the later spec once 
> > > > > > > > > > > > > > > > it is formed in future.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > So what is the way forward h

Re: [PATCH] ring-buffer: Fix slowpath of interrupted event

2023-12-19 Thread Steven Rostedt

On Tue, 19 Dec 2023 23:37:10 +0900
Masami Hiramatsu (Google)  wrote:

> Yeah the above works, but my question is, do we really need this
> really slow path? I mean;
> 
> > if (w == write - event length) {
> > /* Nothing interrupted between A and C */
> >  /*E*/  write_stamp = ts;
> > delta = ts - after  
> 
>   } else {
>   /*
> Something interrupted between A and C, which should record
> a new entry before this reserved entry with newer timestamp.
> we reuse it.
>*/
>   ts = after = write_stamp;
>   delta = 0;
>   }
> 
> Isn't this enough?

I really like to avoid: delta = 0 when possible. It's basically what I do
when I have no other options. Why?

Because let's just say you are looking at the time of interrupt events. If
you just trace the entry of the interrupt, and that interrupt interrupted
an event being written to, we have this:

Time starts at ts 1000, and we are able to calculate the delta of the
interrupted event. And the trace will have:

 1000 - interrupt event
 2000 - normal context event
 2100 - next normal context event

Where we see the delta between the interrupt event and the normal context
event was 1000. But if we just had it be delta = 0, it would be:

 1000 - interrupt event
 1000 - normal context event
 2100 - next normal context event

It will look like the time between the interrupt event and the normal
context event was instant, and the time between the normal context event
and the next normal context event was 1100 when in reality it was just 100.

The above scenario is rather common. Perhaps it happens 1% of the time. The
case where we currently have delta = 0 only happens when the same event
gets interrupted twice. That is, two separate interrupts came in, one
before it allocated its space on the buffer, and one after it allocated.
That's a much more race occurrence (0.01% possibly, or less). In fact my
traces seldom even show it. Most of the time, even when doing function
tracing, I have a hard time seeing this rare race.

So, if we can have delta=0 only 0.01% or less of the time, I rather do that
then have it be delta=0 1% of the time.

Thanks for the review!

-- Steve

Re: [PATCH v5 28/34] fprobe: Rewrite fprobe on function-graph tracer

2023-12-19 Thread Jiri Olsa

On Mon, Dec 18, 2023 at 10:17:10PM +0900, Masami Hiramatsu (Google) wrote:

SNIP

> -static void fprobe_exit_handler(struct rethook_node *rh, void *data,
> - unsigned long ret_ip, struct pt_regs *regs)
> +static int fprobe_entry(unsigned long func, unsigned long ret_ip,
> + struct ftrace_regs *fregs, struct fgraph_ops *gops)
>  {
> - struct fprobe *fp = (struct fprobe *)data;
> - struct fprobe_rethook_node *fpr;
> - struct ftrace_regs *fregs = (struct ftrace_regs *)regs;
> - int bit;
> + struct fprobe_hlist_node *node, *first;
> + unsigned long *fgraph_data = NULL;
> + unsigned long header;
> + int reserved_words;
> + struct fprobe *fp;
> + int used, ret;
>  
> - if (!fp || fprobe_disabled(fp))
> - return;
> + if (WARN_ON_ONCE(!fregs))
> + return 0;
>  
> - fpr = container_of(rh, struct fprobe_rethook_node, node);
> + first = node = find_first_fprobe_node(func);
> + if (unlikely(!first))
> + return 0;
> +
> + reserved_words = 0;
> + hlist_for_each_entry_from_rcu(node, hlist) {
> + if (node->addr != func)
> + break;
> + fp = READ_ONCE(node->fp);
> + if (!fp || !fp->exit_handler)
> + continue;
> + /*
> +  * Since fprobe can be enabled until the next loop, we ignore 
> the
> +  * fprobe's disabled flag in this loop.
> +  */
> + reserved_words +=
> + DIV_ROUND_UP(fp->entry_data_size, sizeof(long)) + 1;
> + }
> + node = first;
> + if (reserved_words) {
> + fgraph_data = fgraph_reserve_data(gops->idx, reserved_words * 
> sizeof(long));
> + if (unlikely(!fgraph_data)) {
> + hlist_for_each_entry_from_rcu(node, hlist) {
> + if (node->addr != func)
> + break;
> + fp = READ_ONCE(node->fp);
> + if (fp && !fprobe_disabled(fp))
> + fp->nmissed++;
> + }
> + return 0;
> + }
> + }

this looks expensive compared to what we do now.. IIUC due to the graph
ops limitations (16 ctive ops), you have just single graph ops for fprobe
and each fprobe registration stores ips into hash which you need to search
in here to get registered callbacks..?

I wonder would it make sense to allow arbitrary number of active graph_ops
with the price some callback might fail because there's no stack space so
each fprobe instance would have its own graph_ops.. and we would get rid
of the code above (and below) ?

jirka

>  
>   /*
> -  * we need to assure no calls to traceable functions in-between the
> -  * end of fprobe_handler and the beginning of fprobe_exit_handler.
> +  * TODO: recursion detection has been done in the fgraph. Thus we need
> +  * to add a callback to increment missed counter.
>*/
> - bit = ftrace_test_recursion_trylock(fpr->entry_ip, 
> fpr->entry_parent_ip);
> - if (bit < 0) {
> - fp->nmissed++;
> + used = 0;
> + hlist_for_each_entry_from_rcu(node, hlist) {
> + void *data;
> +
> + if (node->addr != func)
> + break;
> + fp = READ_ONCE(node->fp);
> + if (!fp || fprobe_disabled(fp))
> + continue;
> +
> + if (fp->entry_data_size && fp->exit_handler)
> + data = fgraph_data + used + 1;
> + else
> + data = NULL;
> +
> + if (fprobe_shared_with_kprobes(fp))
> + ret = __fprobe_kprobe_handler(func, ret_ip, fp, fregs, 
> data);
> + else
> + ret = __fprobe_handler(func, ret_ip, fp, fregs, data);
> + /* If entry_handler returns !0, nmissed is not counted but 
> skips exit_handler. */
> + if (!ret && fp->exit_handler) {
> + int size_words = DIV_ROUND_UP(fp->entry_data_size, 
> sizeof(long));
> +
> + header = encode_fprobe_header(fp, size_words);
> + if (likely(header)) {
> + fgraph_data[used] = header;
> + used += size_words + 1;
> + }
> + }
> + }
> + if (used < reserved_words)
> + memset(fgraph_data + used, 0, reserved_words - used);
> +
> + /* If any exit_handler is set, data must be used. */
> + return used != 0;
> +}
> +NOKPROBE_SYMBOL(fprobe_entry);
> +
> +static void fprobe_return(unsigned long func, unsigned long ret_ip,
> +   struct ftrace_regs *fregs, struct fgraph_ops *gops)
> +{
> + unsigned long *fgraph_data = NULL;
> + unsigned long val;
> + struct fprobe *fp;
> +

Re: [PATCH v2] virtio_balloon: stay awake while adjusting balloon

2023-12-19 Thread David Stevens

On Mon, Dec 18, 2023 at 12:33 PM David Hildenbrand  wrote:
>
> On 18.12.23 16:18, David Stevens wrote:
> > From: David Stevens 
> >
> > A virtio_balloon's parent device may be configured so that a
> > configuration change interrupt is a wakeup event. Extend the processing
> > of such a wakeup event until the balloon finishes inflating or deflating
> > by calling pm_stay_awake/pm_relax in the virtio_balloon driver. Note
> > that these calls are no-ops if the parent device doesn't support wakeup
> > events or if the wakeup events are not enabled.
> >
> > This change allows the guest to use system power states such as s2idle
> > without running the risk the virtio_balloon's cooperative memory
> > management becoming unresponsive to the host's requests.
> >
> > Signed-off-by: David Stevens 
> > ---
> > v1 -> v2:
> >   - Use adjustment_signal_pending flag instead of a sequence number
> >   - Call pm_stay_awake/pm_relax on parent device instead of adding a wake
> > event to the virtio balloon device
> >
> >   drivers/virtio/virtio_balloon.c | 57 +++--
> >   1 file changed, 47 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index 1fe93e93f5bc..a3c11159cbe0 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -119,6 +119,11 @@ struct virtio_balloon {
> >   /* Free page reporting device */
> >   struct virtqueue *reporting_vq;
> >   struct page_reporting_dev_info pr_dev_info;
> > +
> > + /* State for keeping the wakeup_source active while adjusting the 
> > balloon */
> > + spinlock_t adjustment_lock;
> > + bool adjustment_signal_pending;
> > + bool adjustment_in_progress;
> >   };
> >
> >   static const struct virtio_device_id id_table[] = {
> > @@ -437,6 +442,31 @@ static void virtio_balloon_queue_free_page_work(struct 
> > virtio_balloon *vb)
> >   queue_work(vb->balloon_wq, &vb->report_free_page_work);
> >   }
> >
> > +static void start_update_balloon_size(struct virtio_balloon *vb)
> > +{
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&vb->adjustment_lock, flags);
> > + vb->adjustment_signal_pending = true;
> > + if (!vb->adjustment_in_progress) {
> > + vb->adjustment_in_progress = true;
> > + pm_stay_awake(vb->vdev->dev.parent);
> > + }
> > + spin_unlock_irqrestore(&vb->adjustment_lock, flags);
> > +
> > + queue_work(system_freezable_wq, &vb->update_balloon_size_work);
> > +}
> > +
> > +static void end_update_balloon_size(struct virtio_balloon *vb)
> > +{
> > + spin_lock(&vb->adjustment_lock);
> > + if (!vb->adjustment_signal_pending && vb->adjustment_in_progress) {
>
> How could vb->adjustment_in_progress ever not be set at this point?
>
> > + vb->adjustment_in_progress = false;
> > + pm_relax(vb->vdev->dev.parent);
> > + }
> > + spin_unlock(&vb->adjustment_lock);
> > +}
> > +
>
> LGTM, although I wonder what happens when calling pm_stay_awake() etc.
> on a parent device that is not wakeup-even-capable?

If the parent device is not wakeup capable or if wakeup isn't enabled,
then the vb->vdev->dev.parent->power.wakeup pointer will be NULL, so
the NULL checks in __pm_relax/__pm_stay_awake will immediately return.

-David

Re: [PATCH] ring-buffer: Fix slowpath of interrupted event

2023-12-19 Thread Google

On Mon, 18 Dec 2023 23:07:12 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> To synchronize the timestamps with the ring buffer reservation, there are
> two timestamps that are saved in the buffer meta data.
> 
> 1. before_stamp
> 2. write_stamp
> 
> When the two are equal, the write_stamp is considered valid, as in, it may
> be used to calculate the delta of the next event as the write_stamp is the
> timestamp of the previous reserved event on the buffer.
> 
> This is done by the following:
> 
>  /*A*/w = current position on the ring buffer
>   before = before_stamp
>   after = write_stamp
>   ts = read current timestamp
> 
>   if (before != after) {
>   write_stamp is not valid, force adding an absolute
>   timestamp.

(additional comment)
This happens if this caller interrupts another reservation process's B to E.
(thus the original one only updates "before_stamp", but "write_stamp" is old.)

>   }
> 
>  /*B*/before_stamp = ts
> 
>  /*C*/write = local_add_return(event length, position on ring buffer)
> 
>   if (w == write - event length) {
>   /* Nothing interrupted between A and C */
>  /*E*/write_stamp = ts;
>   delta = ts - after
>   /*
>* If nothing interrupted again,
>* before_stamp == write_stamp and write_stamp
>* can be used to calculate the delta for
>* events that come in after this one.
>*/

(additional comment)
If something interrupts between C and E, the write_stamp goes backward
because interrupted one updates the before_stamp and write_stamp with
new timestamp. But in this case, write_stamp(=ts) != before_stamp(=new ts).
Thus anyway the next entry will use absolute time stamp forcibly.

>   } else {
> 
>   /*
>* The slow path!
>* Was interrupted between A and C.
>*/
> 
> This is the place that there's a bug. We currently have:
> 
>   after = write_stamp
>   ts = read current timestamp
> 
>  /*F*/if (write == current position on the ring buffer &&
>   after < ts && cmpxchg(write_stamp, after, ts)) {
> 
>   delta = ts - after;
> 
>   } else {
>   delta = 0;
>   }
> 
> The assumption is that if the current position on the ring buffer hasn't
> moved between C and F, then it also was not interrupted, and that the last
> event written has a timestamp that matches the write_stamp. That is the
> write_stamp is valid.
> 
> But this may not be the case:
> 
> If a task context event was interrupted by softirq between B and C.
> 
> And the softirq wrote an event that got interrupted by a hard irq between
> C and E.
> 
> and the hard irq wrote an event (does not need to be interrupted)
> 
> We have:
> 
>  /*B*/ before_stamp = ts of normal context
> 
>---> interrupted by softirq
> 
>   /*B*/ before_stamp = ts of softirq context
> 
> ---> interrupted by hardirq
> 
>   /*B*/ before_stamp = ts of hard irq context
>   /*E*/ write_stamp = ts of hard irq context
> 
>   /* matches and write_stamp valid */
> <
> 
>   /*E*/ write_stamp = ts of softirq context
> 
>   /* No longer matches before_stamp, write_stamp is not valid! */
> 
><---
> 
>  w != write - length, go to slow path
> 
> // Right now the order of events in the ring buffer is:
> //
> // |-- softirq event --|-- hard irq event --|-- normal context event --|
> //
> 
>  after = write_stamp (this is the ts of softirq)
>  ts = read current timestamp
> 
>  if (write == current position on the ring buffer [true] &&
>  after < ts [true] && cmpxchg(write_stamp, after, ts) [true]) {
> 
>   delta = ts - after  [Wrong!]
> 
> The delta is to be between the hard irq event and the normal context
> event, but the above logic made the delta between the softirq event and
> the normal context event, where the hard irq event is between the two. This
> will shift all the remaining event timestamps on the sub-buffer
> incorrectly.
> 
> The write_stamp is only valid if it matches the before_stamp. The cmpxchg
> does nothing to help this.

That's right. Even if someone interrupts between A and C, we can write
the ts to write_stamp. In this case, write_stamp(old) != before_stamp(new)
so the next entry will forcibly record the absolute timestamp.
In this case, what we need to do here is using the absolute timestamp instead
of delta.

> 
> Instead, the following logic can be done to fix this:
> 
>   before = before_stamp
>   ts = read current timestamp
>   before_stamp = ts
> 
>   after = write_stamp
> 
>   if (write == current position on the ring buffer &&
>   after == before && after < ts) {
> 
>   delta = ts - after
> 
>   } else {
>

[PATCH] modules: wait do_free_init correctly

2023-12-19 Thread Changbin Du

The commit 1a7b7d922081 ("modules: Use vmalloc special flag") moves
do_free_init() into a global workqueue instead of call_rcu(). So now
we should wait it via flush_work().

Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Signed-off-by: Changbin Du 
Cc: Xiaoyi Su 
---
 include/linux/moduleloader.h | 2 ++
 init/main.c  | 5 +++--
 kernel/module/main.c | 5 +
 3 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/include/linux/moduleloader.h b/include/linux/moduleloader.h
index 001b2ce83832..f3d445d8ccd0 100644
--- a/include/linux/moduleloader.h
+++ b/include/linux/moduleloader.h
@@ -115,6 +115,8 @@ int module_finalize(const Elf_Ehdr *hdr,
const Elf_Shdr *sechdrs,
struct module *mod);
 
+void flush_module_init_free_work(void);
+
 /* Any cleanup needed when module leaves. */
 void module_arch_cleanup(struct module *mod);
 
diff --git a/init/main.c b/init/main.c
index e24b0780fdff..f0b7e21ac67f 100644
--- a/init/main.c
+++ b/init/main.c
@@ -99,6 +99,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -1402,11 +1403,11 @@ static void mark_readonly(void)
if (rodata_enabled) {
/*
 * load_module() results in W+X mappings, which are cleaned
-* up with call_rcu().  Let's make sure that queued work is
+* up with init_free_wq. Let's make sure that queued work is
 * flushed so that we don't hit false positives looking for
 * insecure pages which are W+X.
 */
-   rcu_barrier();
+   flush_module_init_free_work();
mark_rodata_ro();
rodata_test();
} else
diff --git a/kernel/module/main.c b/kernel/module/main.c
index 98fedfdb8db5..1943ccb7414f 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -2486,6 +2486,11 @@ static void do_free_init(struct work_struct *w)
}
 }
 
+void flush_module_init_free_work(void)
+{
+   flush_work(&init_free_wq);
+}
+
 #undef MODULE_PARAM_PREFIX
 #define MODULE_PARAM_PREFIX "module."
 /* Default value for module->async_probe_requested */
-- 
2.25.1

Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq

2023-12-19 Thread Eugenio Perez Martin

On Tue, Dec 19, 2023 at 12:16 PM Dragos Tatulea  wrote:
>
> On Tue, 2023-12-19 at 08:24 +0100, Eugenio Perez Martin wrote:
> > On Mon, Dec 18, 2023 at 2:58 PM Dragos Tatulea  wrote:
> > >
> > > On Mon, 2023-12-18 at 13:06 +0100, Eugenio Perez Martin wrote:
> > > > On Mon, Dec 18, 2023 at 11:52 AM Dragos Tatulea  
> > > > wrote:
> > > > >
> > > > > On Mon, 2023-12-18 at 11:16 +0100, Eugenio Perez Martin wrote:
> > > > > > On Sat, Dec 16, 2023 at 12:03 PM Dragos Tatulea 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, 2023-12-15 at 18:56 +0100, Eugenio Perez Martin wrote:
> > > > > > > > On Fri, Dec 15, 2023 at 3:13 PM Dragos Tatulea 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Fri, 2023-12-15 at 12:35 +, Dragos Tatulea wrote:
> > > > > > > > > > On Thu, 2023-12-14 at 19:30 +0100, Eugenio Perez Martin 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Thu, Dec 14, 2023 at 4:51 PM Dragos Tatulea 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 2023-12-14 at 08:45 -0500, Michael S. Tsirkin 
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Thu, Dec 14, 2023 at 01:39:55PM +, Dragos 
> > > > > > > > > > > > > Tatulea wrote:
> > > > > > > > > > > > > > On Tue, 2023-12-12 at 15:44 -0800, Si-Wei Liu wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On 12/12/2023 11:21 AM, Eugenio Perez Martin 
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > > On Tue, Dec 5, 2023 at 11:46 AM Dragos Tatulea 
> > > > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > > > > Addresses get set by .set_vq_address. hw vq 
> > > > > > > > > > > > > > > > > addresses will be updated on
> > > > > > > > > > > > > > > > > next modify_virtqueue.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > Signed-off-by: Dragos Tatulea 
> > > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > > Reviewed-by: Gal Pressman 
> > > > > > > > > > > > > > > > > Acked-by: Eugenio Pérez 
> > > > > > > > > > > > > > > > I'm kind of ok with this patch and the next one 
> > > > > > > > > > > > > > > > about state, but I
> > > > > > > > > > > > > > > > didn't ack them in the previous series.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > My main concern is that it is not valid to 
> > > > > > > > > > > > > > > > change the vq address after
> > > > > > > > > > > > > > > > DRIVER_OK in VirtIO, which vDPA follows. Only 
> > > > > > > > > > > > > > > > memory maps are ok to
> > > > > > > > > > > > > > > > change at this moment. I'm not sure about vq 
> > > > > > > > > > > > > > > > state in vDPA, but vhost
> > > > > > > > > > > > > > > > forbids changing it with an active backend.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Suspend is not defined in VirtIO at this moment 
> > > > > > > > > > > > > > > > though, so maybe it is
> > > > > > > > > > > > > > > > ok to decide that all of these parameters may 
> > > > > > > > > > > > > > > > change during suspend.
> > > > > > > > > > > > > > > > Maybe the best thing is to protect this with a 
> > > > > > > > > > > > > > > > vDPA feature flag.
> > > > > > > > > > > > > > > I think protect with vDPA feature flag could 
> > > > > > > > > > > > > > > work, while on the other
> > > > > > > > > > > > > > > hand vDPA means vendor specific optimization is 
> > > > > > > > > > > > > > > possible around suspend
> > > > > > > > > > > > > > > and resume (in case it helps performance), which 
> > > > > > > > > > > > > > > doesn't have to be
> > > > > > > > > > > > > > > backed by virtio spec. Same applies to vhost user 
> > > > > > > > > > > > > > > backend features,
> > > > > > > > > > > > > > > variations there were not backed by spec either. 
> > > > > > > > > > > > > > > Of course, we should
> > > > > > > > > > > > > > > try best to make the default behavior backward 
> > > > > > > > > > > > > > > compatible with
> > > > > > > > > > > > > > > virtio-based backend, but that circles back to no 
> > > > > > > > > > > > > > > suspend definition in
> > > > > > > > > > > > > > > the current virtio spec, for which I hope we 
> > > > > > > > > > > > > > > don't cease development on
> > > > > > > > > > > > > > > vDPA indefinitely. After all, the virtio based 
> > > > > > > > > > > > > > > vdap backend can well
> > > > > > > > > > > > > > > define its own feature flag to describe (minor 
> > > > > > > > > > > > > > > difference in) the
> > > > > > > > > > > > > > > suspend behavior based on the later spec once it 
> > > > > > > > > > > > > > > is formed in future.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > So what is the way forward here? From what I 
> > > > > > > > > > > > > > understand the options are:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 1) Add a vdpa feature flag for changing device 
> > > > > > > > > > > > > > properties while suspended.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > 2) Drop these 2 patches f

Re: [PATCH v5 24/34] fprobe: Use ftrace_regs in fprobe entry handler

2023-12-19 Thread Jiri Olsa

On Mon, Dec 18, 2023 at 10:16:23PM +0900, Masami Hiramatsu (Google) wrote:

SNIP

> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index 84e8a0f6e4e0..d3f8745d8ead 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -2503,7 +2503,7 @@ static int __init bpf_event_init(void)
>  fs_initcall(bpf_event_init);
>  #endif /* CONFIG_MODULES */
>  
> -#ifdef CONFIG_FPROBE
> +#if defined(CONFIG_FPROBE) && defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
>  struct bpf_kprobe_multi_link {
>   struct bpf_link link;
>   struct fprobe fp;
> @@ -2733,10 +2733,14 @@ kprobe_multi_link_prog_run(struct 
> bpf_kprobe_multi_link *link,
>  
>  static int
>  kprobe_multi_link_handler(struct fprobe *fp, unsigned long fentry_ip,
> -   unsigned long ret_ip, struct pt_regs *regs,
> +   unsigned long ret_ip, struct ftrace_regs *fregs,
> void *data)
>  {
>   struct bpf_kprobe_multi_link *link;
> + struct pt_regs *regs = ftrace_get_regs(fregs);
> +
> + if (!regs)
> + return 0;
>  
>   link = container_of(fp, struct bpf_kprobe_multi_link, fp);
>   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
> @@ -3008,7 +3012,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
> *attr, struct bpf_prog *pr
>   kvfree(cookies);
>   return err;
>  }
> -#else /* !CONFIG_FPROBE */
> +#else /* !CONFIG_FPROBE || !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
>  int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog 
> *prog)
>  {
>   return -EOPNOTSUPP;
> diff --git a/kernel/trace/fprobe.c b/kernel/trace/fprobe.c
> index 6cd2a4e3afb8..f12569494d8a 100644
> --- a/kernel/trace/fprobe.c
> +++ b/kernel/trace/fprobe.c
> @@ -46,7 +46,7 @@ static inline void __fprobe_handler(unsigned long ip, 
> unsigned long parent_ip,
>   }
>  
>   if (fp->entry_handler)
> - ret = fp->entry_handler(fp, ip, parent_ip, 
> ftrace_get_regs(fregs), entry_data);
> + ret = fp->entry_handler(fp, ip, parent_ip, fregs, entry_data);
>  
>   /* If entry_handler returns !0, nmissed is not counted. */
>   if (rh) {
> @@ -182,7 +182,7 @@ static void fprobe_init(struct fprobe *fp)
>   fp->ops.func = fprobe_kprobe_handler;
>   else
>   fp->ops.func = fprobe_handler;
> - fp->ops.flags |= FTRACE_OPS_FL_SAVE_REGS;
> + fp->ops.flags |= FTRACE_OPS_FL_SAVE_ARGS;

so with this change you move to ftrace_caller trampoline,
but we need ftrace_regs_caller right?

otherwise the (!regs) check in kprobe_multi_link_handler
will be allways true IIUC

jirka

>  }
>  
>  static int fprobe_init_rethook(struct fprobe *fp, int num)
> diff --git a/kernel/trace/trace_fprobe.c b/kernel/trace/trace_fprobe.c
> index 7d2ddbcfa377..ef6b36fd05ae 100644
> --- a/kernel/trace/trace_fprobe.c
> +++ b/kernel/trace/trace_fprobe.c
> @@ -320,12 +320,16 @@ NOKPROBE_SYMBOL(fexit_perf_func);
>  #endif   /* CONFIG_PERF_EVENTS */
>  
>  static int fentry_dispatcher(struct fprobe *fp, unsigned long entry_ip,
> -  unsigned long ret_ip, struct pt_regs *regs,
> +  unsigned long ret_ip, struct ftrace_regs *fregs,
>void *entry_data)
>  {
>   struct trace_fprobe *tf = container_of(fp, struct trace_fprobe, fp);
> + struct pt_regs *regs = ftrace_get_regs(fregs);
>   int ret = 0;
>  
> + if (!regs)
> + return 0;
> +
>   if (trace_probe_test_flag(&tf->tp, TP_FLAG_TRACE))
>   fentry_trace_func(tf, entry_ip, regs);
>  #ifdef CONFIG_PERF_EVENTS
> diff --git a/lib/test_fprobe.c b/lib/test_fprobe.c
> index 24de0e5ff859..ff607babba18 100644
> --- a/lib/test_fprobe.c
> +++ b/lib/test_fprobe.c
> @@ -40,7 +40,7 @@ static noinline u32 fprobe_selftest_nest_target(u32 value, 
> u32 (*nest)(u32))
>  
>  static notrace int fp_entry_handler(struct fprobe *fp, unsigned long ip,
>   unsigned long ret_ip,
> - struct pt_regs *regs, void *data)
> + struct ftrace_regs *fregs, void *data)
>  {
>   KUNIT_EXPECT_FALSE(current_test, preemptible());
>   /* This can be called on the fprobe_selftest_target and the 
> fprobe_selftest_target2 */
> @@ -81,7 +81,7 @@ static notrace void fp_exit_handler(struct fprobe *fp, 
> unsigned long ip,
>  
>  static notrace int nest_entry_handler(struct fprobe *fp, unsigned long ip,
> unsigned long ret_ip,
> -   struct pt_regs *regs, void *data)
> +   struct ftrace_regs *fregs, void *data)
>  {
>   KUNIT_EXPECT_FALSE(current_test, preemptible());
>   return 0;
> diff --git a/samples/fprobe/fprobe_example.c b/samples/fprobe/fprobe_example.c
> index 64e715e7ed11..1545a1aac616 100644
> --- a/samples/fprobe/fprobe_example.c
> +++ b/samples/fprobe/fprobe_e

Re: [PATCH v5 06/34] function_graph: Allow multiple users to attach to function graph

2023-12-19 Thread Jiri Olsa

On Mon, Dec 18, 2023 at 10:12:45PM +0900, Masami Hiramatsu (Google) wrote:

SNIP

>  /* Both enabled by default (can be cleared by function_graph tracer flags */
>  static bool fgraph_sleep_time = true;
>  
> @@ -126,9 +247,34 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> long func,
>   calltime = trace_clock_local();
>  
>   index = current->curr_ret_stack;
> - RET_STACK_INC(current->curr_ret_stack);
> + /* ret offset = 1 ; type = reserved */
> + current->ret_stack[index + FGRAPH_RET_INDEX] = 1;
>   ret_stack = RET_STACK(current, index);
> + ret_stack->ret = ret;
> + /*
> +  * The unwinders expect curr_ret_stack to point to either zero
> +  * or an index where to find the next ret_stack. Even though the
> +  * ret stack might be bogus, we want to write the ret and the
> +  * index to find the ret_stack before we increment the stack point.
> +  * If an interrupt comes in now before we increment the curr_ret_stack
> +  * it may blow away what we wrote. But that's fine, because the
> +  * index will still be correct (even though the 'ret' won't be).
> +  * What we worry about is the index being correct after we increment
> +  * the curr_ret_stack and before we update that index, as if an
> +  * interrupt comes in and does an unwind stack dump, it will need
> +  * at least a correct index!
> +  */
>   barrier();
> + current->curr_ret_stack += FGRAPH_RET_INDEX + 1;
> + /*
> +  * This next barrier is to ensure that an interrupt coming in
> +  * will not corrupt what we are about to write.
> +  */
> + barrier();
> +
> + /* Still keep it reserved even if an interrupt came in */
> + current->ret_stack[index + FGRAPH_RET_INDEX] = 1;

seems like this was set already few lines above?

jirka

> +
>   ret_stack->ret = ret;
>   ret_stack->func = func;
>   ret_stack->calltime = calltime;
> @@ -159,6 +305,12 @@ int function_graph_enter(unsigned long ret, unsigned 
> long func,
>unsigned long frame_pointer, unsigned long *retp)
>  {
>   struct ftrace_graph_ent trace;
> + int offset;
> + int start;
> + int type;
> + int val;
> + int cnt = 0;
> + int i;
>  
>  #ifndef CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS
>   /*

SNIP

Re: [PATCH v5 24/34] fprobe: Use ftrace_regs in fprobe entry handler

2023-12-19 Thread Jiri Olsa

On Mon, Dec 18, 2023 at 10:16:23PM +0900, Masami Hiramatsu (Google) wrote:
> From: Masami Hiramatsu (Google) 
> 
> This allows fprobes to be available with CONFIG_DYNAMIC_FTRACE_WITH_ARGS
> instead of CONFIG_DYNAMIC_FTRACE_WITH_REGS, then we can enable fprobe
> on arm64.
> 
> Signed-off-by: Masami Hiramatsu (Google) 
> Acked-by: Florent Revest 

this change breaks kprobe multi bpf tests (crash below), which are
partially fixed by [1] later on, but I think we have to keep
bisecting crash free

it looks like the rethook will get wrong pointer.. I'm still trying
to digest the whole thing, so I might have some updates later ;-)

jirka


[1] fprobe: Rewrite fprobe on function-graph tracer
---
Dec 19 13:50:04 qemu kernel: BUG: kernel NULL pointer dereference, address: 
0098
Dec 19 13:50:04 qemu kernel: #PF: supervisor read access in kernel mode
Dec 19 13:50:04 qemu kernel: #PF: error_code(0x) - not-present page
Dec 19 13:50:04 qemu kernel: PGD 10955f067 P4D 10955f067 PUD 103113067 PMD 0 
Dec 19 13:50:04 qemu kernel: Oops:  [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN 
NOPTI
Dec 19 13:50:04 qemu kernel: CPU: 1 PID: 747 Comm: test_progs Tainted: GB   
   OE  6.7.0-rc3+ #194 85bc8297edbc7f21acfc743dabbd52cac073a6bf
Dec 19 13:50:04 qemu kernel: Hardware name: QEMU Standard PC (Q35 + ICH9, 
2009), BIOS 1.16.2-1.fc38 04/01/2014
Dec 19 13:50:04 qemu kernel: RIP: 0010:arch_rethook_prepare+0x18/0x60
Dec 19 13:50:04 qemu kernel: Code: 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 
90 90 90 41 55 41 54 55 48 89 f5 53 48 89 fb 48 8d be 98 00 00 00 e8 68 8f 59 >
Dec 19 13:50:04 qemu kernel: RSP: 0018:888125f97a88 EFLAGS: 00010286
Dec 19 13:50:04 qemu kernel: RAX: 0001 RBX: 88818a231410 RCX: 
812190b6
Dec 19 13:50:04 qemu kernel: RDX: fbfff0c42e95 RSI: 0008 RDI: 
862174a0
Dec 19 13:50:04 qemu kernel: RBP:  R08: 0001 R09: 
fbfff0c42e94
Dec 19 13:50:04 qemu kernel: R10: 862174a7 R11:  R12: 
88818a231420
Dec 19 13:50:04 qemu kernel: R13: 8283ee8e R14: 88818a231410 R15: 
fff7
Dec 19 13:50:04 qemu kernel: FS:  7ff8a16cfd00() 
GS:88842c60() knlGS:
Dec 19 13:50:05 qemu kernel: CS:  0010 DS:  ES:  CR0: 80050033
Dec 19 13:50:05 qemu kernel: CR2: 0098 CR3: 00010633c005 CR4: 
00770ef0
Dec 19 13:50:05 qemu kernel: PKRU: 5554
Dec 19 13:50:05 qemu kernel: Call Trace:
Dec 19 13:50:05 qemu kernel:  
Dec 19 13:50:05 qemu kernel:  ? __die+0x1f/0x70
Dec 19 13:50:05 qemu kernel:  ? page_fault_oops+0x215/0x620
Dec 19 13:50:05 qemu kernel:  ? rcu_is_watching+0x34/0x60
Dec 19 13:50:05 qemu kernel:  ? __pfx_page_fault_oops+0x10/0x10
Dec 19 13:50:05 qemu kernel:  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
Dec 19 13:50:05 qemu kernel:  ? do_user_addr_fault+0x4b3/0x910
Dec 19 13:50:05 qemu kernel:  ? exc_page_fault+0x77/0x130
Dec 19 13:50:05 qemu kernel:  ? asm_exc_page_fault+0x22/0x30
Dec 19 13:50:05 qemu kernel:  ? bpf_prog_test_run_tracing+0x1ce/0x2d0
Dec 19 13:50:05 qemu kernel:  ? add_taint+0x26/0x90
Dec 19 13:50:05 qemu kernel:  ? arch_rethook_prepare+0x18/0x60
Dec 19 13:50:05 qemu kernel:  ? arch_rethook_prepare+0x18/0x60
Dec 19 13:50:05 qemu kernel:  ? bpf_prog_test_run_tracing+0x1ce/0x2d0
Dec 19 13:50:05 qemu kernel:  rethook_hook+0x1e/0x50
Dec 19 13:50:05 qemu kernel:  ? __pfx_bpf_fentry_test1+0x10/0x10
Dec 19 13:50:05 qemu kernel:  ? bpf_prog_test_run_tracing+0x1ce/0x2d0
Dec 19 13:50:05 qemu kernel:  fprobe_handler+0x1ca/0x350
Dec 19 13:50:05 qemu kernel:  ? __pfx_bpf_fentry_test1+0x10/0x10
Dec 19 13:50:05 qemu kernel:  arch_ftrace_ops_list_func+0x143/0x2e0
Dec 19 13:50:05 qemu kernel:  ? bpf_prog_test_run_tracing+0x1ce/0x2d0
Dec 19 13:50:05 qemu kernel:  ftrace_call+0x5/0x44
Dec 19 13:50:05 qemu kernel:  ? __pfx_lock_release+0x10/0x10
Dec 19 13:50:05 qemu kernel:  ? rcu_is_watching+0x34/0x60
Dec 19 13:50:05 qemu kernel:  ? bpf_prog_test_run_tracing+0xcd/0x2d0
Dec 19 13:50:05 qemu kernel:  ? bpf_fentry_test1+0x5/0x10
Dec 19 13:50:05 qemu kernel:  ? rcu_is_watching+0x34/0x60
Dec 19 13:50:05 qemu kernel:  bpf_fentry_test1+0x5/0x10
Dec 19 13:50:05 qemu kernel:  bpf_prog_test_run_tracing+0x1ce/0x2d0
Dec 19 13:50:05 qemu kernel:  ? __pfx_lock_release+0x10/0x10
Dec 19 13:50:05 qemu kernel:  ? __pfx_bpf_prog_test_run_tracing+0x10/0x10
Dec 19 13:50:05 qemu kernel:  ? __pfx_lock_release+0x10/0x10
Dec 19 13:50:05 qemu kernel:  ? __fget_light+0xdf/0x100
Dec 19 13:50:05 qemu kernel:  ? __bpf_prog_get+0x107/0x150
Dec 19 13:50:05 qemu kernel:  __sys_bpf+0x552/0x2ef0
Dec 19 13:50:05 qemu kernel:  ? rcu_is_watching+0x34/0x60
Dec 19 13:50:05 qemu kernel:  ? __pfx___sys_bpf+0x10/0x10
Dec 19 13:50:05 qemu kernel:  ? __pfx_lock_release+0x10/0x10
Dec 19 13:50:05 qemu kernel:  ? vfs_write+0x1fa/0x740
Dec 19 13:50:05 qemu kernel:  ? rcu_is_watching+0x34/0x60
Dec 19 13:50:05 qemu kernel:  ? rcu_is_watching+0x34/0x60
Dec 19 13:50:05

[PATCH] ring-buffer: Check if absolute timestamp goes backwards

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The check_buffer() which checks the timestamps of the ring buffer
sub-buffer page, when enabled, only checks if the adding of deltas of the
events from the last absolute timestamp or the timestamp of the sub-buffer
page adds up to the current event.

What it does not check is if the absolute timestamp causes the time of the
events to go backwards, as that can cause issues elsewhere.

Test for the timestamp going backwards too.

This also fixes a slight issue where if the warning triggers at boot up
(because of the resetting of the tsc), it will disable all further checks,
even those that are after boot Have it continue checking if the warning
was ignored during boot up.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 46 +++---
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index c0cc45482e1e..f7dc74e45ebf 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3309,6 +3309,23 @@ static void dump_buffer_page(struct buffer_data_page 
*bpage,
 static DEFINE_PER_CPU(atomic_t, checking);
 static atomic_t ts_dump;
 
+#define buffer_warn_return(fmt, ...)   \
+   do {\
+   /* If another report is happening, ignore this one */   \
+   if (atomic_inc_return(&ts_dump) != 1) { \
+   atomic_dec(&ts_dump);   \
+   goto out;   \
+   }   \
+   atomic_inc(&cpu_buffer->record_disabled);   \
+   pr_warn(fmt, ##__VA_ARGS__);\
+   dump_buffer_page(bpage, info, tail);\
+   atomic_dec(&ts_dump);   \
+   /* There's some cases in boot up that this can happen */ \
+   if (WARN_ON_ONCE(system_state != SYSTEM_BOOTING))   \
+   /* Do not re-enable checking */ \
+   return; \
+   } while (0)
+
 /*
  * Check if the current event time stamp matches the deltas on
  * the buffer page.
@@ -3362,7 +3379,12 @@ static void check_buffer(struct ring_buffer_per_cpu 
*cpu_buffer,
 
case RINGBUF_TYPE_TIME_STAMP:
delta = rb_event_time_stamp(event);
-   ts = rb_fix_abs_ts(delta, ts);
+   delta = rb_fix_abs_ts(delta, ts);
+   if (delta < ts) {
+   buffer_warn_return("[CPU: %d]ABSOLUTE TIME WENT 
BACKWARDS: last ts: %lld absolute ts: %lld\n",
+  cpu_buffer->cpu, ts, delta);
+   }
+   ts = delta;
break;
 
case RINGBUF_TYPE_PADDING:
@@ -3379,23 +3401,11 @@ static void check_buffer(struct ring_buffer_per_cpu 
*cpu_buffer,
}
if ((full && ts > info->ts) ||
(!full && ts + info->delta != info->ts)) {
-   /* If another report is happening, ignore this one */
-   if (atomic_inc_return(&ts_dump) != 1) {
-   atomic_dec(&ts_dump);
-   goto out;
-   }
-   atomic_inc(&cpu_buffer->record_disabled);
-   /* There's some cases in boot up that this can happen */
-   WARN_ON_ONCE(system_state != SYSTEM_BOOTING);
-   pr_warn("[CPU: %d]TIME DOES NOT MATCH expected:%lld actual:%lld 
delta:%lld before:%lld after:%lld%s context:%s\n",
-   cpu_buffer->cpu,
-   ts + info->delta, info->ts, info->delta,
-   info->before, info->after,
-   full ? " (full)" : "", show_interrupt_level());
-   dump_buffer_page(bpage, info, tail);
-   atomic_dec(&ts_dump);
-   /* Do not re-enable checking */
-   return;
+   buffer_warn_return("[CPU: %d]TIME DOES NOT MATCH expected:%lld 
actual:%lld delta:%lld before:%lld after:%lld%s context:%s\n",
+  cpu_buffer->cpu,
+  ts + info->delta, info->ts, info->delta,
+  info->before, info->after,
+  full ? " (full)" : "", 
show_interrupt_level());
}
 out:
atomic_dec(this_cpu_ptr(&checking));
-- 
2.42.0

[PATCH] ring-buffer: Add interrupt information to dump of data sub-buffer

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

When the ring buffer timestamp verifier triggers, it dumps the content of
the sub-buffer. But currently it only dumps the timestamps and the offset
of the data as well as the deltas. It would be even more informative if
the event data also showed the interrupt context level it was in.

That is, if each event showed that the event was written in normal,
softirq, irq or NMI context. Then a better idea about how the events may
have been interrupted from each other.

As the payload of the ring buffer is really a black box of the ring
buffer, just assume that if the payload is larger than a trace entry, that
it is a trace entry. As trace entries have the interrupt context
information saved in a flags field, look at that location and report the
output of the flags.

If the payload is not a trace entry, there's no way to really know, and
the information will be garbage. But that's OK, because this is for
debugging only (this output is not used in production as the buffer check
that calls it causes a huge overhead to the tracing). This information,
when available, is crucial for debugging timestamp issues. If it's
garbage, it will also be pretty obvious that its garbage too.

As this output usually happens in kselftests of the tracing code, the user
will know what the payload is at the time.

Suggested-by: Joel Fernandes (Google) 
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 79 --
 1 file changed, 75 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index a4ada4f303c4..c0cc45482e1e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -3185,6 +3185,76 @@ EXPORT_SYMBOL_GPL(ring_buffer_unlock_commit);
 #define CHECK_FULL_PAGE1L
 
 #ifdef CONFIG_RING_BUFFER_VALIDATE_TIME_DELTAS
+
+static const char *show_irq_str(int bits)
+{
+   const char *type[] = {
+   ".",// 0
+   "s",// 1
+   "h",// 2
+   "Hs",   // 3
+   "n",// 4
+   "Ns",   // 5
+   "Nh",   // 6
+   "NHs",  // 7
+   };
+
+   return type[bits];
+}
+
+/* Assume this is an trace event */
+static const char *show_flags(struct ring_buffer_event *event)
+{
+   struct trace_entry *entry;
+   int bits = 0;
+
+   if (rb_event_data_length(event) - RB_EVNT_HDR_SIZE < sizeof(*entry))
+   return "X";
+
+   entry = ring_buffer_event_data(event);
+
+   if (entry->flags & TRACE_FLAG_SOFTIRQ)
+   bits |= 1;
+
+   if (entry->flags & TRACE_FLAG_HARDIRQ)
+   bits |= 2;
+
+   if (entry->flags & TRACE_FLAG_NMI)
+   bits |= 4;
+
+   return show_irq_str(bits);
+}
+
+static const char *show_irq(struct ring_buffer_event *event)
+{
+   struct trace_entry *entry;
+
+   if (rb_event_data_length(event) - RB_EVNT_HDR_SIZE < sizeof(*entry))
+   return "";
+
+   entry = ring_buffer_event_data(event);
+   if (entry->flags & TRACE_FLAG_IRQS_OFF)
+   return "d";
+   return "";
+}
+
+static const char *show_interrupt_level(void)
+{
+   unsigned long pc = preempt_count();
+   unsigned char level = 0;
+
+   if (pc & SOFTIRQ_OFFSET)
+   level |= 1;
+
+   if (pc & HARDIRQ_MASK)
+   level |= 2;
+
+   if (pc & NMI_MASK)
+   level |= 4;
+
+   return show_irq_str(level);
+}
+
 static void dump_buffer_page(struct buffer_data_page *bpage,
 struct rb_event_info *info,
 unsigned long tail)
@@ -3224,8 +3294,9 @@ static void dump_buffer_page(struct buffer_data_page 
*bpage,
 
case RINGBUF_TYPE_DATA:
ts += event->time_delta;
-   pr_warn(" 0x%x:  [%lld] delta:%d\n",
-   e, ts, event->time_delta);
+   pr_warn(" 0x%x:  [%lld] delta:%d %s%s\n",
+   e, ts, event->time_delta,
+   show_flags(event), show_irq(event));
break;
 
default:
@@ -3316,11 +3387,11 @@ static void check_buffer(struct ring_buffer_per_cpu 
*cpu_buffer,
atomic_inc(&cpu_buffer->record_disabled);
/* There's some cases in boot up that this can happen */
WARN_ON_ONCE(system_state != SYSTEM_BOOTING);
-   pr_warn("[CPU: %d]TIME DOES NOT MATCH expected:%lld actual:%lld 
delta:%lld before:%lld after:%lld%s\n",
+   pr_warn("[CPU: %d]TIME DOES NOT MATCH expected:%lld actual:%lld 
delta:%lld before:%lld after:%lld%s context:%s\n",
cpu_buffer->cpu,
ts + info->delta, info->ts, info->delta,
info->before, info->after,
-   full ? " (full)" : "");
+

[PATCH v3] ring-buffer: Remove 32bit timestamp logic

2023-12-19 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

Each event has a 27 bit timestamp delta that is used to hold the delta
from the last event. If the time between events is greater than 2^27, then
a timestamp is added that holds a 59 bit absolute timestamp.

Until a389d86f7fd09 ("ring-buffer: Have nested events still record running
time stamp"), if an interrupt interrupted an event in progress, all the
events delta would be zero to not deal with the races that need to be
handled. The commit a389d86f7fd09 changed that to handle the races giving
all events, even those that preempt other events, still have an accurate
timestamp.

To handle those races requires performing 64-bit cmpxchg on the
timestamps. But doing 64-bit cmpxchg on 32-bit architectures is considered
very slow. To try to deal with this the timestamp logic was broken into
two and then three 32-bit cmpxchgs, with the thought that two (or three)
32-bit cmpxchgs are still faster than a single 64-bit cmpxchg on 32-bit
architectures.

Part of the problem with this is that I didn't have any 32-bit
architectures to test on. After hitting several subtle bugs in this code,
an effort was made to try and see if three 32-bit cmpxchgs are indeed
faster than a single 64-bit. After a few people brushed off the dust of
their old 32-bit machines, tests were done, and even though 32-bit cmpxchg
was faster than a single 64-bit, it was in the order of 50% at best, not
300%.

After some more refactoring of the code, all 4 64-bit cmpxchg were removed:

 
https://lore.kernel.org/linux-trace-kernel/2023124420.36dde...@gandalf.local.home
 
https://lore.kernel.org/linux-trace-kernel/20231214222921.19303...@gandalf.local.home
 
https://lore.kernel.org/linux-trace-kernel/20231215081810.1f4f3...@rorschach.local.home
 
https://lore.kernel.org/linux-trace-kernel/20231218230712.3a76b...@gandalf.local.home/

With all the 64-bit cmpxchg removed, the complex 32-bit workaround can also be
removed.

The 32-bit and 64-bit logic is now exactly the same.

Link: https://lore.kernel.org/all/20231213214632.15047...@gandalf.local.home/

Signed-off-by: Steven Rostedt (Google) 
---
Changes since v2: 
https://lkml.kernel.org/linux-trace-kernel/20231218194336.159193...@goodmis.org

- Refactored to reflect that all 64-bit cmpxchg has been removed

 kernel/trace/ring_buffer.c | 176 ++---
 1 file changed, 9 insertions(+), 167 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 16b640d824f9..a4ada4f303c4 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 
+#include 
 #include 
 
 /*
@@ -463,27 +464,9 @@ enum {
RB_CTX_MAX
 };
 
-#if BITS_PER_LONG == 32
-#define RB_TIME_32
-#endif
-
-/* To test on 64 bit machines */
-//#define RB_TIME_32
-
-#ifdef RB_TIME_32
-
-struct rb_time_struct {
-   local_t cnt;
-   local_t top;
-   local_t bottom;
-   local_t msb;
-};
-#else
-#include 
 struct rb_time_struct {
local64_t   time;
 };
-#endif
 typedef struct rb_time_struct rb_time_t;
 
 #define MAX_NEST   5
@@ -573,147 +556,14 @@ struct ring_buffer_iter {
int missed_events;
 };
 
-#ifdef RB_TIME_32
-
-/*
- * On 32 bit machines, local64_t is very expensive. As the ring
- * buffer doesn't need all the features of a true 64 bit atomic,
- * on 32 bit, it uses these functions (64 still uses local64_t).
- *
- * For the ring buffer, 64 bit required operations for the time is
- * the following:
- *
- *  - Reads may fail if it interrupted a modification of the time stamp.
- *  It will succeed if it did not interrupt another write even if
- *  the read itself is interrupted by a write.
- *  It returns whether it was successful or not.
- *
- *  - Writes always succeed and will overwrite other writes and writes
- *  that were done by events interrupting the current write.
- *
- *  - A write followed by a read of the same time stamp will always succeed,
- *  but may not contain the same value.
- *
- *  - A cmpxchg will fail if it interrupted another write or cmpxchg.
- *  Other than that, it acts like a normal cmpxchg.
- *
- * The 60 bit time stamp is broken up by 30 bits in a top and bottom half
- *  (bottom being the least significant 30 bits of the 60 bit time stamp).
- *
- * The two most significant bits of each half holds a 2 bit counter (0-3).
- * Each update will increment this counter by one.
- * When reading the top and bottom, if the two counter bits match then the
- *  top and bottom together make a valid 60 bit number.
- */
-#define RB_TIME_SHIFT  30
-#define RB_TIME_VAL_MASK ((1 << RB_TIME_SHIFT) - 1)
-#define RB_TIME_MSB_SHIFT   60
-
-static inline int rb_time_cnt(unsigned long val)
-{
-   return (val >> RB_TIME_SHIFT) & 3;
-}
-
-static inline u64 rb_time_val(unsigned long top, unsigned long bottom)
-{
-   u64 val;
-
-   val = top & RB_TIME_VAL_MA

Re: [PATCH vhost v2 4/8] vdpa/mlx5: Mark vq addrs for modification in hw vq

2023-12-19 Thread Dragos Tatulea

On Tue, 2023-12-19 at 08:24 +0100, Eugenio Perez Martin wrote:
> On Mon, Dec 18, 2023 at 2:58 PM Dragos Tatulea  wrote:
> > 
> > On Mon, 2023-12-18 at 13:06 +0100, Eugenio Perez Martin wrote:
> > > On Mon, Dec 18, 2023 at 11:52 AM Dragos Tatulea  
> > > wrote:
> > > > 
> > > > On Mon, 2023-12-18 at 11:16 +0100, Eugenio Perez Martin wrote:
> > > > > On Sat, Dec 16, 2023 at 12:03 PM Dragos Tatulea  
> > > > > wrote:
> > > > > > 
> > > > > > On Fri, 2023-12-15 at 18:56 +0100, Eugenio Perez Martin wrote:
> > > > > > > On Fri, Dec 15, 2023 at 3:13 PM Dragos Tatulea 
> > > > > > >  wrote:
> > > > > > > > 
> > > > > > > > On Fri, 2023-12-15 at 12:35 +, Dragos Tatulea wrote:
> > > > > > > > > On Thu, 2023-12-14 at 19:30 +0100, Eugenio Perez Martin wrote:
> > > > > > > > > > On Thu, Dec 14, 2023 at 4:51 PM Dragos Tatulea 
> > > > > > > > > >  wrote:
> > > > > > > > > > > 
> > > > > > > > > > > On Thu, 2023-12-14 at 08:45 -0500, Michael S. Tsirkin 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Thu, Dec 14, 2023 at 01:39:55PM +, Dragos 
> > > > > > > > > > > > Tatulea wrote:
> > > > > > > > > > > > > On Tue, 2023-12-12 at 15:44 -0800, Si-Wei Liu wrote:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On 12/12/2023 11:21 AM, Eugenio Perez Martin wrote:
> > > > > > > > > > > > > > > On Tue, Dec 5, 2023 at 11:46 AM Dragos Tatulea 
> > > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > > > Addresses get set by .set_vq_address. hw vq 
> > > > > > > > > > > > > > > > addresses will be updated on
> > > > > > > > > > > > > > > > next modify_virtqueue.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Signed-off-by: Dragos Tatulea 
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > Reviewed-by: Gal Pressman 
> > > > > > > > > > > > > > > > Acked-by: Eugenio Pérez 
> > > > > > > > > > > > > > > I'm kind of ok with this patch and the next one 
> > > > > > > > > > > > > > > about state, but I
> > > > > > > > > > > > > > > didn't ack them in the previous series.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > My main concern is that it is not valid to change 
> > > > > > > > > > > > > > > the vq address after
> > > > > > > > > > > > > > > DRIVER_OK in VirtIO, which vDPA follows. Only 
> > > > > > > > > > > > > > > memory maps are ok to
> > > > > > > > > > > > > > > change at this moment. I'm not sure about vq 
> > > > > > > > > > > > > > > state in vDPA, but vhost
> > > > > > > > > > > > > > > forbids changing it with an active backend.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > Suspend is not defined in VirtIO at this moment 
> > > > > > > > > > > > > > > though, so maybe it is
> > > > > > > > > > > > > > > ok to decide that all of these parameters may 
> > > > > > > > > > > > > > > change during suspend.
> > > > > > > > > > > > > > > Maybe the best thing is to protect this with a 
> > > > > > > > > > > > > > > vDPA feature flag.
> > > > > > > > > > > > > > I think protect with vDPA feature flag could work, 
> > > > > > > > > > > > > > while on the other
> > > > > > > > > > > > > > hand vDPA means vendor specific optimization is 
> > > > > > > > > > > > > > possible around suspend
> > > > > > > > > > > > > > and resume (in case it helps performance), which 
> > > > > > > > > > > > > > doesn't have to be
> > > > > > > > > > > > > > backed by virtio spec. Same applies to vhost user 
> > > > > > > > > > > > > > backend features,
> > > > > > > > > > > > > > variations there were not backed by spec either. Of 
> > > > > > > > > > > > > > course, we should
> > > > > > > > > > > > > > try best to make the default behavior backward 
> > > > > > > > > > > > > > compatible with
> > > > > > > > > > > > > > virtio-based backend, but that circles back to no 
> > > > > > > > > > > > > > suspend definition in
> > > > > > > > > > > > > > the current virtio spec, for which I hope we don't 
> > > > > > > > > > > > > > cease development on
> > > > > > > > > > > > > > vDPA indefinitely. After all, the virtio based vdap 
> > > > > > > > > > > > > > backend can well
> > > > > > > > > > > > > > define its own feature flag to describe (minor 
> > > > > > > > > > > > > > difference in) the
> > > > > > > > > > > > > > suspend behavior based on the later spec once it is 
> > > > > > > > > > > > > > formed in future.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > So what is the way forward here? From what I 
> > > > > > > > > > > > > understand the options are:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 1) Add a vdpa feature flag for changing device 
> > > > > > > > > > > > > properties while suspended.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 2) Drop these 2 patches from the series for now. Not 
> > > > > > > > > > > > > sure if this makes sense as
> > > > > > > > > > > > > this. But then Si-Wei's qemu device suspend/resume 
> > > > > > > > > > > > > poc [0] that exercises this
> > > > > > > > > > > > > code won't work anymore. Th

Re: [PATCH v4 2/2] leds: qcom-lpg: Add PM660L configuration and compatible

2023-12-19 Thread Marijn Suijten

On 2023-12-19 10:33:25, Johan Hovold wrote:
> On Tue, Dec 19, 2023 at 10:17:16AM +0100, Marijn Suijten wrote:
> 
> > Note that I have one more unmerged leds patch around, that hasn't been 
> > looked
> > at either.  Would it help to send this once again, perhaps with more 
> > reviewers/
> > testing (Johan, would you mind taking a look too)?
> > 
> > https://lore.kernel.org/linux-leds/20220719213034.1664056-1-marijn.suij...@somainline.org/
> 
> Yes, I suggest you resend that one too so that it ends up in Lee's
> inbox.

I will rebase, test and resend it too.  Just asking if you notice any glaring
issues with this patch, as it won't be the first time it has been resent after
not being looked at for some time.

- Marijn

Re: [PATCH v4 2/2] leds: qcom-lpg: Add PM660L configuration and compatible

2023-12-19 Thread Marijn Suijten

Hi Johan and Lee,

On 2023-12-19 09:22:43, Johan Hovold wrote:
> Hi Marijn and Lee,
> 
> On Tue, Jul 19, 2022 at 11:18:48PM +0200, Marijn Suijten wrote:
> > Inherit PM660L PMIC LPG/triled block configuration from downstream
> > drivers and DT sources, consisting of a triled block with automatic
> > trickle charge control and source selection, three colored led channels
> > belonging to the synchronized triled block and one loose PWM channel.
> > 
> > Signed-off-by: Marijn Suijten 
> > Reviewed-by: Bjorn Andersson 
> > ---
> > 
> > Changes since v3:
> > - Rebased on -next;
> > - (series) dropped DTS patches that have been applied through the
> >   Qualcomm DTS tree, leaving only leds changes (driver and
> >   accompanying dt-bindings).
> 
> > diff --git a/drivers/leds/rgb/leds-qcom-lpg.c 
> > b/drivers/leds/rgb/leds-qcom-lpg.c
> > index 02f51cc61837..102ab0c33887 100644
> > --- a/drivers/leds/rgb/leds-qcom-lpg.c
> > +++ b/drivers/leds/rgb/leds-qcom-lpg.c
> > @@ -1304,6 +1304,23 @@ static int lpg_remove(struct platform_device *pdev)
> > return 0;
> >  }
> >  
> > +static const struct lpg_data pm660l_lpg_data = {
> > +   .lut_base = 0xb000,
> > +   .lut_size = 49,
> > +
> > +   .triled_base = 0xd000,
> > +   .triled_has_atc_ctl = true,
> > +   .triled_has_src_sel = true,
> > +
> > +   .num_channels = 4,
> > +   .channels = (const struct lpg_channel_data[]) {
> > +   { .base = 0xb100, .triled_mask = BIT(5) },
> > +   { .base = 0xb200, .triled_mask = BIT(6) },
> > +   { .base = 0xb300, .triled_mask = BIT(7) },
> > +   { .base = 0xb400 },
> > +   },
> > +};
> > +
> >  static const struct lpg_data pm8916_pwm_data = {
> > .num_channels = 1,
> > .channels = (const struct lpg_channel_data[]) {
> > @@ -1424,6 +1441,7 @@ static const struct lpg_data pm8350c_pwm_data = {
> >  };
> >  
> >  static const struct of_device_id lpg_of_table[] = {
> > +   { .compatible = "qcom,pm660l-lpg", .data = &pm660l_lpg_data },
> > { .compatible = "qcom,pm8150b-lpg", .data = &pm8150b_lpg_data },
> > { .compatible = "qcom,pm8150l-lpg", .data = &pm8150l_lpg_data },
> > { .compatible = "qcom,pm8350c-pwm", .data = &pm8350c_pwm_data },
> 
> When reviewing the Qualcomm SPMI PMIC bindings I noticed that this patch
> was never picked up by the LEDs maintainer, while the binding and dtsi
> changes made it in:
> 
>   
> https://lore.kernel.org/r/20220719211848.1653920-2-marijn.suij...@somainline.org
> 
> Looks like it may still apply cleanly, but otherwise, would you mind
> rebasing and resending so that Lee can pick this one up?
> 
> Johan

Coincidentally I haven't touched this device/platform for months... until last
weekend where I noticed the same.  It does not apply cleanly and I had to solve
some conflicts:

https://github.com/SoMainline/linux/commit/8ec5d02eaffcec24fcab6a989ab117a5b72b96b6

I'll gladly resend this!

Note that I have one more unmerged leds patch around, that hasn't been looked
at either.  Would it help to send this once again, perhaps with more reviewers/
testing (Johan, would you mind taking a look too)?

https://lore.kernel.org/linux-leds/20220719213034.1664056-1-marijn.suij...@somainline.org/

Thanks!

- Marijn

Re: [PATCH v4 2/2] leds: qcom-lpg: Add PM660L configuration and compatible

2023-12-19 Thread Johan Hovold

On Tue, Dec 19, 2023 at 10:17:16AM +0100, Marijn Suijten wrote:

> Note that I have one more unmerged leds patch around, that hasn't been looked
> at either.  Would it help to send this once again, perhaps with more 
> reviewers/
> testing (Johan, would you mind taking a look too)?
> 
> https://lore.kernel.org/linux-leds/20220719213034.1664056-1-marijn.suij...@somainline.org/

Yes, I suggest you resend that one too so that it ends up in Lee's
inbox.

Johan

Re: [PATCH 0/2] kmod /usr support

2023-12-19 Thread Masahiro Yamada

On Thu, Dec 7, 2023 at 3:37 AM Lucas De Marchi  wrote:
>
> On Fri, Nov 10, 2023 at 01:13:53PM +0100, Michal Suchanek wrote:
> >Hello,
> >
> >This is resend of the last patch in the series that adds prefix support
> >to kernel module location together with additional patch for validating
> >the user supplied input to options that are interpreted as directories.
> >
> >Thanks
>
> applied, thanks
>
> Lucas De Marchi

If I understood this correctly, MODULE_DIRECTORY is determined
by "configure --with-module-directory=...", and there is no
way to change it after that.

If so, how to work with cross-building?

Cross-building is typical when building embedded Linux systems.

Consider this scenario:

- Your build machine adopts
MODULE_DIRECTORY=/usr/lib/modules
- The target embedded system adopts
MODULE_DIRECTORY=/lib/modules

(or vice a versa)

depmod is used also for cross-building because
it is executed as a part of "make module_install".

The counterpart patch set for Kbuild provides
KERNEL_MODULE_DIRECTORY, which only changes
the destination directory to which *.ko are copied.

You cannot change the directory where the
depmod searches for modules, as it is fixed
at the compile-time of kmod.

In this case, what we can do is to build another
instance of kmod configured for the target system,
and use it for modules_install:

1. In the kmod source directory
./configure --with=module-directory=/lib/modules
make

2. make modules_install INSTALL_MOD_PATH=
 KERNEL_MODULE_DIRECTORY=/lib/modules
 DEPMOD=

If you use OpenEmbedded etc., this is what you do
because host tools are built from sources.

But, should it be required all the time?
Even when the target embedded system uses
busybox-based modprobe instead of kmod?

depmod provides --basedir option, which changes
the prefix part, but there is no way to override
the stem part, MODULE_DIRECTRY.

In the review of the counter patch set,
I am suggesting an option to override MODULE_DIRECTRY
(let's say --moduledir) at least for depmod.

(Perhaps modinfo too, as it also supports --basedir)

Then, we can change scripts/depmod.sh so that
Kbuild can propagate KERNEL_MODULE_DIRECTORY
to depmod.

if  ; then
set -- "$@"  --moduledir "${KERNEL_MODULE_DIRECTORY}"
fi

Does it make sense?

-- 
Best Regards
Masahiro Yamada

Re: [PATCH v4 2/2] leds: qcom-lpg: Add PM660L configuration and compatible

2023-12-19 Thread Johan Hovold

Hi Marijn and Lee,

On Tue, Jul 19, 2022 at 11:18:48PM +0200, Marijn Suijten wrote:
> Inherit PM660L PMIC LPG/triled block configuration from downstream
> drivers and DT sources, consisting of a triled block with automatic
> trickle charge control and source selection, three colored led channels
> belonging to the synchronized triled block and one loose PWM channel.
> 
> Signed-off-by: Marijn Suijten 
> Reviewed-by: Bjorn Andersson 
> ---
> 
> Changes since v3:
> - Rebased on -next;
> - (series) dropped DTS patches that have been applied through the
>   Qualcomm DTS tree, leaving only leds changes (driver and
>   accompanying dt-bindings).

> diff --git a/drivers/leds/rgb/leds-qcom-lpg.c 
> b/drivers/leds/rgb/leds-qcom-lpg.c
> index 02f51cc61837..102ab0c33887 100644
> --- a/drivers/leds/rgb/leds-qcom-lpg.c
> +++ b/drivers/leds/rgb/leds-qcom-lpg.c
> @@ -1304,6 +1304,23 @@ static int lpg_remove(struct platform_device *pdev)
>   return 0;
>  }
>  
> +static const struct lpg_data pm660l_lpg_data = {
> + .lut_base = 0xb000,
> + .lut_size = 49,
> +
> + .triled_base = 0xd000,
> + .triled_has_atc_ctl = true,
> + .triled_has_src_sel = true,
> +
> + .num_channels = 4,
> + .channels = (const struct lpg_channel_data[]) {
> + { .base = 0xb100, .triled_mask = BIT(5) },
> + { .base = 0xb200, .triled_mask = BIT(6) },
> + { .base = 0xb300, .triled_mask = BIT(7) },
> + { .base = 0xb400 },
> + },
> +};
> +
>  static const struct lpg_data pm8916_pwm_data = {
>   .num_channels = 1,
>   .channels = (const struct lpg_channel_data[]) {
> @@ -1424,6 +1441,7 @@ static const struct lpg_data pm8350c_pwm_data = {
>  };
>  
>  static const struct of_device_id lpg_of_table[] = {
> + { .compatible = "qcom,pm660l-lpg", .data = &pm660l_lpg_data },
>   { .compatible = "qcom,pm8150b-lpg", .data = &pm8150b_lpg_data },
>   { .compatible = "qcom,pm8150l-lpg", .data = &pm8150l_lpg_data },
>   { .compatible = "qcom,pm8350c-pwm", .data = &pm8350c_pwm_data },

When reviewing the Qualcomm SPMI PMIC bindings I noticed that this patch
was never picked up by the LEDs maintainer, while the binding and dtsi
changes made it in:


https://lore.kernel.org/r/20220719211848.1653920-2-marijn.suij...@somainline.org

Looks like it may still apply cleanly, but otherwise, would you mind
rebasing and resending so that Lee can pick this one up?

Johan

95 matches

Mail list logo