Re: [PATCH v4] virtio_pmem: support feature SHMEM_REGION

2023-12-26 Thread Jason Wang
On Thu, Dec 21, 2023 at 4:49 AM Changyuan Lyu  wrote:
>
> Thanks Michael for the feedback!
>
> On Tue, Dec 19, 2023 at 11:44 PM Michael S. Tsirkin  wrote:
> >
> > > On Tue, Dec 19, 2023 at 11:32:27PM -0800, Changyuan Lyu wrote:
> > >
> > > +   if (!have_shm) {
> > > +   dev_err(>dev, "failed to get shared memory 
> > > region %d\n",
> > > +   VIRTIO_PMEM_SHMEM_REGION_ID);
> > > +   err = -ENXIO;
> > > +   goto out_vq;
> > > +   }
> >
> > Maybe additionally, add a validate callback and clear
> > VIRTIO_PMEM_F_SHMEM_REGION if VIRTIO_PMEM_SHMEM_REGION_ID is not there.
>
> Done.
>
> > > +/* Feature bits */
> > > +#define VIRTIO_PMEM_F_SHMEM_REGION 0   /* guest physical address 
> > > range will be
> > > +* indicated as shared memory region 0
> > > +*/
> >
> > Either make this comment shorter to fit in one line, or put the
> > multi-line comment before the define.
>
> Done.
>
> ---8<---
>
> This patch adds the support for feature VIRTIO_PMEM_F_SHMEM_REGION
> (virtio spec v1.2 section 5.19.5.2 [1]).
>
> During feature negotiation, if VIRTIO_PMEM_F_SHMEM_REGION is offered
> by the device, the driver looks for a shared memory region of id 0.
> If it is found, this feature is understood. Otherwise, this feature
> bit is cleared.
>
> During probe, if VIRTIO_PMEM_F_SHMEM_REGION has been negotiated,
> virtio pmem ignores the `start` and `size` fields in device config
> and uses the physical address range of shared memory region 0.
>
> [1] 
> https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-6480002
>
> Signed-off-by: Changyuan Lyu 

Acked-by: Jason Wang 

Thanks

> ---
> v4:
>   * added virtio_pmem_validate callback.
> v3:
>   * updated the patch description.
> V2:
>   * renamed VIRTIO_PMEM_SHMCAP_ID to VIRTIO_PMEM_SHMEM_REGION_ID
>   * fixed the error handling when region 0 does not exist
> ---
>  drivers/nvdimm/virtio_pmem.c | 36 
>  include/uapi/linux/virtio_pmem.h |  7 +++
>  2 files changed, 39 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> index a92eb172f0e7..4ceced5cefcf 100644
> --- a/drivers/nvdimm/virtio_pmem.c
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -29,12 +29,27 @@ static int init_vq(struct virtio_pmem *vpmem)
> return 0;
>  };
>
> +static int virtio_pmem_validate(struct virtio_device *vdev)
> +{
> +   struct virtio_shm_region shm_reg;
> +
> +   if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION) &&
> +   !virtio_get_shm_region(vdev, _reg, 
> (u8)VIRTIO_PMEM_SHMEM_REGION_ID)
> +   ) {
> +   dev_notice(>dev, "failed to get shared memory region 
> %d\n",
> +   VIRTIO_PMEM_SHMEM_REGION_ID);
> +   __virtio_clear_bit(vdev, VIRTIO_PMEM_F_SHMEM_REGION);
> +   }
> +   return 0;
> +}
> +
>  static int virtio_pmem_probe(struct virtio_device *vdev)
>  {
> struct nd_region_desc ndr_desc = {};
> struct nd_region *nd_region;
> struct virtio_pmem *vpmem;
> struct resource res;
> +   struct virtio_shm_region shm_reg;
> int err = 0;
>
> if (!vdev->config->get) {
> @@ -57,10 +72,16 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> goto out_err;
> }
>
> -   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> -   start, >start);
> -   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> -   size, >size);
> +   if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION)) {
> +   virtio_get_shm_region(vdev, _reg, 
> (u8)VIRTIO_PMEM_SHMEM_REGION_ID);
> +   vpmem->start = shm_reg.addr;
> +   vpmem->size = shm_reg.len;
> +   } else {
> +   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> +   start, >start);
> +   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> +   size, >size);
> +   }
>
> res.start = vpmem->start;
> res.end   = vpmem->start + vpmem->size - 1;
> @@ -122,10 +143,17 @@ static void virtio_pmem_remove(struct virtio_device 
> *vdev)
> virtio_reset_device(vdev);
>  }
>
> +static unsigned int features[] = {
> +   VIRTIO_PMEM_F_SHMEM_REGION,
> +};
> +
>  static struct virtio_driver virtio_pmem_driver = {
> +   .feature_table  = features,
> +   .feature_table_size = ARRAY_SIZE(features),
> .driver.name= KBUILD_MODNAME,
> .driver.owner   = THIS_MODULE,
> .id_table   = id_table,
> +   .validate   = virtio_pmem_validate,
> .probe  = virtio_pmem_probe,
> .remove = 

Re: [PATCH V1] net: qrtr: ns: Ignore ENODEV failures in ns

2023-12-26 Thread Chris Lew




On 12/23/2023 5:56 AM, Simon Horman wrote:

[Dropped bjorn.anders...@kernel.org, as the correct address seems
  to be anders...@kernel.org, which is already in the CC list.
  kernel.org rejected sending this email without that update.]

On Thu, Dec 21, 2023 at 03:36:50PM +0530, Sarannya S wrote:

From: Chris Lew 

Ignore the ENODEV failures returned by kernel_sendmsg(). These errors
indicate that either the local port has been closed or the remote has
gone down. Neither of these scenarios are fatal and will eventually be
handled through packets that are later queued on the control port.

Signed-off-by: Chris Lew 
Signed-off-by: Sarannya Sasikumar 
---
  net/qrtr/ns.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/qrtr/ns.c b/net/qrtr/ns.c
index abb0c70..8234339 100644
--- a/net/qrtr/ns.c
+++ b/net/qrtr/ns.c
@@ -157,7 +157,7 @@ static int service_announce_del(struct sockaddr_qrtr *dest,
msg.msg_namelen = sizeof(*dest);
  
  	ret = kernel_sendmsg(qrtr_ns.sock, , , 1, sizeof(pkt));

-   if (ret < 0)
+   if (ret < 0 && ret != -ENODEV)
pr_err("failed to announce del service\n");
  
  	return ret;


Hi,

The caller of service_announce_del() ignores it's return value.
So the only action on error is the pr_err() call above, and so
with this patch -ENODEV is indeed ignored.

However, I wonder if it would make things clearer to the reader (me?)
if the return type of service_announce_del was updated void. Because
as things stand -ENODEV may be returned, which implies something might
handle that, even though it doe not.

The above notwithstanding, this change looks good to me.

Reviewed-by: Simon Horman 

...


Hi Simon, thanks for the review and suggestion. We weren't sure whether 
we should change the function prototype on these patches on the chance 
that there will be something that listens and handles this in the 
future. I think it's a good idea to change it to void and we can change 
it back if there is such a usecase in the future.




Re: [PATCH] ring-buffer: Fix wake ups when buffer_percent is set to 100

2023-12-26 Thread Google
On Tue, 26 Dec 2023 12:59:02 -0500
Steven Rostedt  wrote:

> From: "Steven Rostedt (Google)" 
> 
> The tracefs file "buffer_percent" is to allow user space to set a
> water-mark on how much of the tracing ring buffer needs to be filled in
> order to wake up a blocked reader.
> 
>  0 - is to wait until any data is in the buffer
>  1 - is to wait for 1% of the sub buffers to be filled
>  50 - would be half of the sub buffers are filled with data
>  100 - is not to wake the waiter until the ring buffer is completely full
> 
> Unfortunately the test for being full was:
> 
>   dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
>   return (dirty * 100) > (full * nr_pages);
> 
> Where "full" is the value for "buffer_percent".
> 
> There is two issues with the above when full == 100.
> 
> 1. dirty * 100 > 100 * nr_pages will never be true
>That is, the above is basically saying that if the user sets
>buffer_percent to 100, more pages need to be dirty than exist in the
>ring buffer!
> 
> 2. The page that the writer is on is never considered dirty, as dirty
>pages are only those that are full. When the writer goes to a new
>sub-buffer, it clears the contents of that sub-buffer.
> 
> That is, even if the check was ">=" it would still not be equal as the
> most pages that can be considered "dirty" is nr_pages - 1.
> 
> To fix this, add one to dirty and use ">=" in the compare.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage")
> Signed-off-by: Steven Rostedt (Google) 
> ---
>  kernel/trace/ring_buffer.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index 83eab547f1d1..32c0dd2fd1c3 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct trace_buffer 
> *buffer, int cpu, int f
>   if (!nr_pages || !full)
>   return true;
>  
> - dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
> + /*
> +  * Add one as dirty will never equal nr_pages, as the sub-buffer
> +  * that the writer is on is not counted as dirty.
> +  * This is needed if "buffer_percent" is set to 100.
> +  */
> + dirty = ring_buffer_nr_dirty_pages(buffer, cpu) + 1;

Is this "+ 1" required? If we have 200 pages and 1 buffer is dirty,
it is 0.5% dirty. Consider @full = 1%.

@dirty = 1 + 1 = 2 and @dirty * 100 == 200. but 
@full * @nr_pages = 1 * 200 = 200.
Thus it hits (200 >= 200 is true) even if dirty pages are 0.5%.

>  
> - return (dirty * 100) > (full * nr_pages);
> + return (dirty * 100) >= (full * nr_pages);

Thank you,

>  }
>  
>  /*
> -- 
> 2.42.0
> 


-- 
Masami Hiramatsu (Google) 



Re: [PATCH v2] vhost-vdpa: account iommu allocations

2023-12-26 Thread David Rientjes
On Tue, 26 Dec 2023, Pasha Tatashin wrote:

> iommu allocations should be accounted in order to allow admins to
> monitor and limit the amount of iommu memory.
> 
> Signed-off-by: Pasha Tatashin 
> Acked-by: Michael S. Tsirkin 

Acked-by: David Rientjes 



[PATCH v2] vhost-vdpa: account iommu allocations

2023-12-26 Thread Pasha Tatashin
iommu allocations should be accounted in order to allow admins to
monitor and limit the amount of iommu memory.

Signed-off-by: Pasha Tatashin 
Acked-by: Michael S. Tsirkin 
---
 drivers/vhost/vdpa.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Changelog:

v1:
This patch is spinned of from the series:
https://lore.kernel.org/all/20231128204938.1453583-1-pasha.tatas...@soleen.com

v2:
- Synced with v6.7-rc7
- Added Acked-by Michael S. Tsirkin.

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index da7ec77cdaff..a51c69c078d9 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -968,7 +968,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v, struct 
vhost_iotlb *iotlb,
r = ops->set_map(vdpa, asid, iotlb);
} else {
r = iommu_map(v->domain, iova, pa, size,
- perm_to_iommu_flags(perm), GFP_KERNEL);
+ perm_to_iommu_flags(perm),
+ GFP_KERNEL_ACCOUNT);
}
if (r) {
vhost_iotlb_del_range(iotlb, iova, iova + size - 1);
-- 
2.43.0.472.g3155946c3a-goog




[PATCH] ring-buffer: Fix wake ups when buffer_percent is set to 100

2023-12-26 Thread Steven Rostedt
From: "Steven Rostedt (Google)" 

The tracefs file "buffer_percent" is to allow user space to set a
water-mark on how much of the tracing ring buffer needs to be filled in
order to wake up a blocked reader.

 0 - is to wait until any data is in the buffer
 1 - is to wait for 1% of the sub buffers to be filled
 50 - would be half of the sub buffers are filled with data
 100 - is not to wake the waiter until the ring buffer is completely full

Unfortunately the test for being full was:

dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
return (dirty * 100) > (full * nr_pages);

Where "full" is the value for "buffer_percent".

There is two issues with the above when full == 100.

1. dirty * 100 > 100 * nr_pages will never be true
   That is, the above is basically saying that if the user sets
   buffer_percent to 100, more pages need to be dirty than exist in the
   ring buffer!

2. The page that the writer is on is never considered dirty, as dirty
   pages are only those that are full. When the writer goes to a new
   sub-buffer, it clears the contents of that sub-buffer.

That is, even if the check was ">=" it would still not be equal as the
most pages that can be considered "dirty" is nr_pages - 1.

To fix this, add one to dirty and use ">=" in the compare.

Cc: sta...@vger.kernel.org
Fixes: 03329f9939781 ("tracing: Add tracefs file buffer_percentage")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index 83eab547f1d1..32c0dd2fd1c3 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -881,9 +881,14 @@ static __always_inline bool full_hit(struct trace_buffer 
*buffer, int cpu, int f
if (!nr_pages || !full)
return true;
 
-   dirty = ring_buffer_nr_dirty_pages(buffer, cpu);
+   /*
+* Add one as dirty will never equal nr_pages, as the sub-buffer
+* that the writer is on is not counted as dirty.
+* This is needed if "buffer_percent" is set to 100.
+*/
+   dirty = ring_buffer_nr_dirty_pages(buffer, cpu) + 1;
 
-   return (dirty * 100) > (full * nr_pages);
+   return (dirty * 100) >= (full * nr_pages);
 }
 
 /*
-- 
2.42.0




Re: [PATCH v5 06/34] function_graph: Allow multiple users to attach to function graph

2023-12-26 Thread Google
Hi,

On Wed, 20 Dec 2023 00:45:40 +0900
Masami Hiramatsu (Google)  wrote:

> OK, I think we need a "rsrv_ret_stack" index. Basically new one will do;
> 
> (1) increment rsrv_ret_stack
> (2) write a reserve type entry
> (3) set curr_ret_stack = rsrv_ret_stack
> 
> And before those,
> 
> (0) if rsrv_ret_stack != curr_ret_stack, write a reserve type entry at
> rsrv_ret_stack for the previous frame (which offset can be read
> from curr_ret_stack)
> 
> Than it will never be broken.
> (of course when decrement curr_ret_stack, rsrv_ret_stack is also decremented)
> 

So here is an additional patch for this issue. I'll make v6 with this.

Thanks,

>From 4da1ec7b679052a131ecdeebd2e1a9db767c5c24 Mon Sep 17 00:00:00 2001
From: Masami Hiramatsu 
Date: Wed, 27 Dec 2023 00:09:09 +0900
Subject: [PATCH] function_graph: Improve push operation for several interrupts

Improve push and data reserve operation on the shadow stack for
several sequencial interrupts.

To push a ret_stack or data entry on the shadow stack, we need to
prepare an index (offset) entry before updating the stack pointer
(curr_ret_stack) so that unwinder from interrupts can find the
next return address from the shadow stack. Currently we do write index,
update the curr_ret_stack, and rewrite it again. But that is not enough
for the case if two interrupts happens and the first one breaks it.
For example,

 1. write reserved index entry at ret_stack[new_index - 1] and ret addr.
 2. interrupt comes.
2.1. push new index and ret addr on ret_stack.
2.2. pop it. (corrupt entries on new_index - 1)
 3. return from interrupt.
 4. update curr_ret_stack = new_index
 5. interrupt comes again.
5.1. unwind <-- may not work.

To avoid this issue, this introduces a new rsrv_ret_stack stack
reservation pointer and a new push code (slow path) to commit
previous reserved code forcibly.

 0. update rsrv_ret_stack = new_index.
 1. write reserved index entry at ret_stack[new_index - 1] and ret addr.
 2. interrupt comes.
2.0. if rsrv_ret_stack != curr_ret_stack, add reserved index
entry on ret_stack[rsrv_ret_stack - 1] to point the previous
ret_stack pointed by ret_stack[curr_ret_stack - 1]. and
update curr_ret_stack = rsrv_ret_stack.
2.1. push new index and ret addr on ret_stack.
2.2. pop it. (corrupt entries on new_index - 1)
 3. return from interrupt.
 4. update curr_ret_stack = new_index
 5. interrupt comes again.
5.1. unwind works, because curr_ret_stack points the previously
saved ret_stack.
5.2. this can do push/pop operations too.
6. return from interrupt.
7. rewrite reserved index entry at ret_stack[new_index] again.

This maybe a bit heavier but safer.

Signed-off-by: Masami Hiramatsu (Google) 
---
 include/linux/sched.h |   1 +
 kernel/trace/fgraph.c | 133 ++
 2 files changed, 97 insertions(+), 37 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4dab30f00211..fda551e1aade 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1387,6 +1387,7 @@ struct task_struct {
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
/* Index of current stored address in ret_stack: */
int curr_ret_stack;
+   int rsrv_ret_stack;
int curr_ret_depth;
 
/* Stack of return addresses for return function tracing: */
diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 47f389834b50..bf7a6eebff75 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -298,31 +298,47 @@ void *fgraph_reserve_data(int idx, int size_bytes)
unsigned long val;
void *data;
int curr_ret_stack = current->curr_ret_stack;
+   int rsrv_ret_stack = current->rsrv_ret_stack;
int data_size;
 
if (size_bytes > FGRAPH_MAX_DATA_SIZE)
return NULL;
 
+   /*
+* Since this API is used after pushing ret_stack, curr_ret_stack
+* should be synchronized with rsrv_ret_stack.
+*/
+   if (WARN_ON_ONCE(curr_ret_stack != rsrv_ret_stack))
+   return NULL;
+
/* Convert to number of longs + data word */
data_size = DIV_ROUND_UP(size_bytes, sizeof(long));
 
val = get_fgraph_entry(current, curr_ret_stack - 1);
data = >ret_stack[curr_ret_stack];
 
-   curr_ret_stack += data_size + 1;
-   if (unlikely(curr_ret_stack >= SHADOW_STACK_MAX_INDEX))
+   rsrv_ret_stack += data_size + 1;
+   if (unlikely(rsrv_ret_stack >= SHADOW_STACK_MAX_INDEX))
return NULL;
 
val = make_fgraph_data(idx, data_size, __get_index(val) + data_size + 
1);
 
-   /* Set the last word to be reserved */
-   current->ret_stack[curr_ret_stack - 1] = val;
-
-   /* Make sure interrupts see this */
+   /* Extend the reserved-ret_stack at first */
+   current->rsrv_ret_stack = rsrv_ret_stack;
+   /* And sync