Re: [PATCH vhost v9 05/12] virtio_ring: split: virtqueue_add_split() support premapped

2023-05-17 Thread Jason Wang
On Wed, May 17, 2023 at 10:23 AM Xuan Zhuo  wrote:
>
> virtqueue_add_split() only supports virtual addresses, dma is completed
> in virtqueue_add_split().
>
> In some scenarios (such as the AF_XDP scenario), the memory is allocated
> and DMA is completed in advance, so it is necessary for us to support
> passing the DMA address to virtqueue_add_split().
>
> Record this information in desc_state, we can skip unmap based on this
> when executing dma unmap.

I would also suggest documenting why a per descriptor metadata is
needed instead of a per virtqueue one.

>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/virtio/virtio_ring.c | 38 +++-
>  1 file changed, 29 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index e2fc50c05bec..bd5e84afab37 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -70,6 +70,7 @@
>  struct vring_desc_state_split {
> void *data; /* Data for callback. */
> struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
> +   bool premapped; /* DMA mapping is done by driver. */

Going back to the original discussion around where this should be
placed. I wonder if we can find a common place to store this since it
has nothing related to virtqueue layout. Maybe desc_extra? And it
would be even better if we can avoid stressing the cache like above.

>  };
>
>  struct vring_desc_state_packed {
> @@ -356,8 +357,14 @@ static struct device *vring_dma_dev(const struct 
> vring_virtqueue *vq)
>
>  /* Map one sg entry. */
>  static int vring_map_one_sg(const struct vring_virtqueue *vq, struct 
> scatterlist *sg,
> -   enum dma_data_direction direction, static 
> dma_addr_t *addr)
> +   enum dma_data_direction direction,
> +   bool premapped, dma_addr_t *addr)

having things like:

int func(bool do)
{
if (!do)
return;
}

is a hint that the check needs to be done by the caller?

And this change should work for both packed and split. I think we need
to squash the packed changes here.

Looking at how packed virtqueue uses this in this patch, I don't think
this patch can even be built. I will wait for a new version and
continue the review from there.

Thanks



>  {
> +   if (premapped) {
> +   *addr = sg_dma_address(sg);
> +   return 0;
> +   }
> +
> if (!vq->use_dma_api) {
> /*
>  * If DMA is not used, KMSAN doesn't know that the scatterlist
> @@ -445,7 +452,7 @@ static void vring_unmap_one_split_indirect(const struct 
> vring_virtqueue *vq,
>  }
>
>  static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> - unsigned int i)
> + unsigned int i, bool premapped)
>  {
> struct vring_desc_extra *extra = vq->split.desc_extra;
> u16 flags;
> @@ -462,6 +469,9 @@ static unsigned int vring_unmap_one_split(const struct 
> vring_virtqueue *vq,
>  (flags & VRING_DESC_F_WRITE) ?
>  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> } else {
> +   if (premapped)
> +   goto out;
> +
> dma_unmap_page(vring_dma_dev(vq),
>extra[i].addr,
>extra[i].len,
> @@ -532,6 +542,7 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>   unsigned int in_sgs,
>   void *data,
>   void *ctx,
> + bool premapped,
>   gfp_t gfp)
>  {
> struct vring_virtqueue *vq = to_vvq(_vq);
> @@ -595,7 +606,7 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
> for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> dma_addr_t addr;
>
> -   if (vring_map_one_sg(vq, sg, DMA_TO_DEVICE, &addr))
> +   if (vring_map_one_sg(vq, sg, DMA_TO_DEVICE, 
> premapped, &addr))
> goto unmap_release;
>
> prev = i;
> @@ -611,7 +622,7 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
> for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> dma_addr_t addr;
>
> -   if (vring_map_one_sg(vq, sg, DMA_FROM_DEVICE, &addr))
> +   if (vring_map_one_sg(vq, sg, DMA_FROM_DEVICE, 
> premapped, &addr))
> goto unmap_release;
>
> prev = i;
> @@ -657,6 +668,7 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>
> /* Store token and indirect buffer state. */
> vq->split.

Re: [PATCH vhost v9 04/12] virtio_ring: virtqueue_add() support premapped

2023-05-17 Thread Jason Wang
On Wed, May 17, 2023 at 10:23 AM Xuan Zhuo  wrote:
>
> virtuque_add() adds parameter premapped.

I wonder if this patch is over simplified. Maybe it can be squashed
with the patch that implements the premapped logic.

Thanks


>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/virtio/virtio_ring.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 1ffab1eb40c0..e2fc50c05bec 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -2135,6 +2135,7 @@ static inline int virtqueue_add(struct virtqueue *_vq,
> unsigned int in_sgs,
> void *data,
> void *ctx,
> +   bool premapped,
> gfp_t gfp)
>  {
> struct vring_virtqueue *vq = to_vvq(_vq);
> @@ -2176,7 +2177,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq,
> total_sg++;
> }
> return virtqueue_add(_vq, sgs, total_sg, out_sgs, in_sgs,
> -data, NULL, gfp);
> +data, NULL, false, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_sgs);
>
> @@ -2198,7 +2199,7 @@ int virtqueue_add_outbuf(struct virtqueue *vq,
>  void *data,
>  gfp_t gfp)
>  {
> -   return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, gfp);
> +   return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, false, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_outbuf);
>
> @@ -2220,7 +2221,7 @@ int virtqueue_add_inbuf(struct virtqueue *vq,
> void *data,
> gfp_t gfp)
>  {
> -   return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, gfp);
> +   return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, false, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_inbuf);
>
> @@ -2244,7 +2245,7 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq,
> void *ctx,
> gfp_t gfp)
>  {
> -   return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, gfp);
> +   return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, false, gfp);
>  }
>  EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx);
>
> --
> 2.32.0.3.g01195cf9f
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vhost v9 03/12] virtio_ring: check use_dma_api before unmap desc for indirect

2023-05-17 Thread Jason Wang
On Wed, May 17, 2023 at 10:23 AM Xuan Zhuo  wrote:
>
> Inside detach_buf_split(), if use_dma_api is false,
> vring_unmap_one_split_indirect will be called many times, but actually
> nothing is done. So this patch check use_dma_api firstly.
>
> Signed-off-by: Xuan Zhuo 

Acked-by: Jason Wang 

Thanks


> ---
>  drivers/virtio/virtio_ring.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 479203346c36..1ffab1eb40c0 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -783,8 +783,10 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
> unsigned int head,
> VRING_DESC_F_INDIRECT));
> BUG_ON(len == 0 || len % sizeof(struct vring_desc));
>
> -   for (j = 0; j < len / sizeof(struct vring_desc); j++)
> -   vring_unmap_one_split_indirect(vq, &indir_desc[j]);
> +   if (vq->use_dma_api) {
> +   for (j = 0; j < len / sizeof(struct vring_desc); j++)
> +   vring_unmap_one_split_indirect(vq, 
> &indir_desc[j]);
> +   }
>
> kfree(indir_desc);
> state->indir_desc = NULL;
> --
> 2.32.0.3.g01195cf9f
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vhost v9 02/12] virtio_ring: simplify the reference of desc state inside detach_buf_split()

2023-05-17 Thread Jason Wang
On Wed, May 17, 2023 at 10:23 AM Xuan Zhuo  wrote:
>
> The purpose of this is to simplify the reference to state. It is
> convenient for subsequent commit.

It's better to be verbose, e.g how it can simplify the following patches.

Thanks


>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/virtio/virtio_ring.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index c563215be6b9..479203346c36 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -744,11 +744,14 @@ static bool virtqueue_kick_prepare_split(struct 
> virtqueue *_vq)
>  static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head,
>  void **ctx)
>  {
> +   struct vring_desc_state_split *state;
> unsigned int i, j;
> __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT);
>
> +   state = &vq->split.desc_state[head];
> +
> /* Clear data ptr. */
> -   vq->split.desc_state[head].data = NULL;
> +   state->data = NULL;
>
> /* Put back on free list: unmap first-level descriptors and find end 
> */
> i = head;
> @@ -767,8 +770,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
> unsigned int head,
> vq->vq.num_free++;
>
> if (vq->indirect) {
> -   struct vring_desc *indir_desc =
> -   vq->split.desc_state[head].indir_desc;
> +   struct vring_desc *indir_desc = state->indir_desc;
> u32 len;
>
> /* Free the indirect table, if any, now that it's unmapped. */
> @@ -785,9 +787,9 @@ static void detach_buf_split(struct vring_virtqueue *vq, 
> unsigned int head,
> vring_unmap_one_split_indirect(vq, &indir_desc[j]);
>
> kfree(indir_desc);
> -   vq->split.desc_state[head].indir_desc = NULL;
> +   state->indir_desc = NULL;
> } else if (ctx) {
> -   *ctx = vq->split.desc_state[head].indir_desc;
> +   *ctx = state->indir_desc;
> }
>  }
>
> --
> 2.32.0.3.g01195cf9f
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH vhost v9 01/12] virtio_ring: put mapping error check in vring_map_one_sg

2023-05-17 Thread Jason Wang
On Wed, May 17, 2023 at 10:23 AM Xuan Zhuo  wrote:
>
> This patch put the dma addr error check in vring_map_one_sg().
>
> The benefits of doing this:
>
> 1. make vring_map_one_sg more simple, without calling
>vring_mapping_error to check the return value.
> 2. reduce one judgment of vq->use_dma_api.

Code looks fine but it's better to explain how it relates or simply
anything with this series.

Thanks


>
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/virtio/virtio_ring.c | 37 +---
>  1 file changed, 22 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index c5310eaf8b46..c563215be6b9 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -355,9 +355,8 @@ static struct device *vring_dma_dev(const struct 
> vring_virtqueue *vq)
>  }
>
>  /* Map one sg entry. */
> -static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
> -  struct scatterlist *sg,
> -  enum dma_data_direction direction)
> +static int vring_map_one_sg(const struct vring_virtqueue *vq, struct 
> scatterlist *sg,
> +   enum dma_data_direction direction, static 
> dma_addr_t *addr)
>  {
> if (!vq->use_dma_api) {
> /*
> @@ -366,7 +365,8 @@ static dma_addr_t vring_map_one_sg(const struct 
> vring_virtqueue *vq,
>  * depending on the direction.
>  */
> kmsan_handle_dma(sg_page(sg), sg->offset, sg->length, 
> direction);
> -   return (dma_addr_t)sg_phys(sg);
> +   *addr = (dma_addr_t)sg_phys(sg);
> +   return 0;
> }
>
> /*
> @@ -374,9 +374,14 @@ static dma_addr_t vring_map_one_sg(const struct 
> vring_virtqueue *vq,
>  * the way it expects (we don't guarantee that the scatterlist
>  * will exist for the lifetime of the mapping).
>  */
> -   return dma_map_page(vring_dma_dev(vq),
> +   *addr = dma_map_page(vring_dma_dev(vq),
> sg_page(sg), sg->offset, sg->length,
> direction);
> +
> +   if (dma_mapping_error(vring_dma_dev(vq), *addr))
> +   return -ENOMEM;
> +
> +   return 0;
>  }
>
>  static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
> @@ -588,8 +593,9 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
>
> for (n = 0; n < out_sgs; n++) {
> for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> -   dma_addr_t addr = vring_map_one_sg(vq, sg, 
> DMA_TO_DEVICE);
> -   if (vring_mapping_error(vq, addr))
> +   dma_addr_t addr;
> +
> +   if (vring_map_one_sg(vq, sg, DMA_TO_DEVICE, &addr))
> goto unmap_release;
>
> prev = i;
> @@ -603,8 +609,9 @@ static inline int virtqueue_add_split(struct virtqueue 
> *_vq,
> }
> for (; n < (out_sgs + in_sgs); n++) {
> for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> -   dma_addr_t addr = vring_map_one_sg(vq, sg, 
> DMA_FROM_DEVICE);
> -   if (vring_mapping_error(vq, addr))
> +   dma_addr_t addr;
> +
> +   if (vring_map_one_sg(vq, sg, DMA_FROM_DEVICE, &addr))
> goto unmap_release;
>
> prev = i;
> @@ -1279,9 +1286,8 @@ static int virtqueue_add_indirect_packed(struct 
> vring_virtqueue *vq,
>
> for (n = 0; n < out_sgs + in_sgs; n++) {
> for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> -   addr = vring_map_one_sg(vq, sg, n < out_sgs ?
> -   DMA_TO_DEVICE : DMA_FROM_DEVICE);
> -   if (vring_mapping_error(vq, addr))
> +   if (vring_map_one_sg(vq, sg, n < out_sgs ?
> +DMA_TO_DEVICE : DMA_FROM_DEVICE, 
> &addr))
> goto unmap_release;
>
> desc[i].flags = cpu_to_le16(n < out_sgs ?
> @@ -1426,9 +1432,10 @@ static inline int virtqueue_add_packed(struct 
> virtqueue *_vq,
> c = 0;
> for (n = 0; n < out_sgs + in_sgs; n++) {
> for (sg = sgs[n]; sg; sg = sg_next(sg)) {
> -   dma_addr_t addr = vring_map_one_sg(vq, sg, n < 
> out_sgs ?
> -   DMA_TO_DEVICE : DMA_FROM_DEVICE);
> -   if (vring_mapping_error(vq, addr))
> +   dma_addr_t addr;
> +
> +   if (vring_map_one_sg(vq, sg, n < out_sgs ?
> +DMA_TO_DEVICE : DMA_FROM_DEVICE, 
> &addr))
> goto unmap_release;
>
> flags = cpu_to_le16(vq->packed.av

Re: [PATCH v2 2/3] vhost: support PACKED when setting-getting vring_base

2023-05-17 Thread Jason Wang
On Wed, May 17, 2023 at 3:00 PM Stefano Garzarella  wrote:
>
> On Wed, May 17, 2023 at 7:26 AM Jason Wang  wrote:
> >
> > On Wed, May 17, 2023 at 2:26 AM Shannon Nelson  
> > wrote:
> > >
> > > On 5/16/23 12:49 AM, Stefano Garzarella wrote:
> > > > On Mon, May 15, 2023 at 01:41:12PM -0700, Shannon Nelson wrote:
> > > >> On 5/9/23 1:46 AM, Stefano Garzarella wrote:
> > > >>> On Mon, Apr 24, 2023 at 03:50:30PM -0700, Shannon Nelson via
> > > >>> Virtualization wrote:
> > >  Use the right structs for PACKED or split vqs when setting and
> > >  getting the vring base.
> > > 
> > >  Signed-off-by: Shannon Nelson 
> > >  ---
> > >  drivers/vhost/vhost.c | 18 +-
> > >  drivers/vhost/vhost.h |  8 ++--
> > >  2 files changed, 19 insertions(+), 7 deletions(-)
> > > 
> > >  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > >  index f11bdbe4c2c5..f64efda48f21 100644
> > >  --- a/drivers/vhost/vhost.c
> > >  +++ b/drivers/vhost/vhost.c
> > >  @@ -1633,17 +1633,25 @@ long vhost_vring_ioctl(struct vhost_dev
> > >  *d, unsigned int ioctl, void __user *arg
> > >    r = -EFAULT;
> > >    break;
> > >    }
> > >  -  if (s.num > 0x) {
> > >  -  r = -EINVAL;
> > >  -  break;
> > >  +  if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
> > >  +  vq->last_avail_idx = s.num & 0x;
> > >  +  vq->last_used_idx = (s.num >> 16) & 0x;
> > >  +  } else {
> > >  +  if (s.num > 0x) {
> > >  +  r = -EINVAL;
> > >  +  break;
> > >  +  }
> > >  +  vq->last_avail_idx = s.num;
> > >    }
> > >  -  vq->last_avail_idx = s.num;
> > >    /* Forget the cached index value. */
> > >    vq->avail_idx = vq->last_avail_idx;
> > >    break;
> > >    case VHOST_GET_VRING_BASE:
> > >    s.index = idx;
> > >  -  s.num = vq->last_avail_idx;
> > >  +  if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
> > >  +  s.num = (u32)vq->last_avail_idx |
> > >  ((u32)vq->last_used_idx << 16);
> > >  +  else
> > >  +  s.num = vq->last_avail_idx;
> > > >>>
> > > >>> The changes LGTM, but since we are changing the UAPI, should we
> > > >>> update the documentation of VHOST_SET_VRING_BASE and
> > > >>> VHOST_GET_VRING_BASE in include/uapi/linux/vhost.h?
> > > >>
> > > >> Correct me if I'm wrong, but I don't think we're changing anything in
> > > >> the UAPI here, just fixing code to work correctly with what is already
> > > >> happening.
> > > >
> > > > IIUC before this patch VHOST_GET_VRING_BASE and VHOST_SET_VRING_BASE
> > > > never worked with packed virtqueue, since we were only handling
> > > > last_avail_idx. Now we are supporting packed virtqueue, handling
> > > > in vhost_vring_state.num both last_avail_idx and last_used_idx (with
> > > > wrap counters).
> > > >
> > > > For example for VHOST_GET_VRING_BASE where is documented that the first
> > > > 15 bits are last_avail_idx, the 16th the avail_wrap_counter, and the
> > > > others are last_used_idx and used_wrap_counter?
> > > >
> > > > Maybe I missed something, but since this is UAPI, IMHO we should
> > > > document the parameters of ioctls at least in
> > > > include/uapi/linux/vhost.h.
> > >
> > > Perhaps Jason already has something written up that could be put in here
> > > from when he first added the wrap_counter a couple of years ago?
> >
> > If you meant the virtio driver support for packed, I think it's
> > different from the context which is vhost here.
> >
> > I agree with Stefano that we need to update the comments around
> > GET_VRING_BASE and SET_VRING_BASE, then we are fine.
>
> I'm thinking if we should also add a new VHOST_BACKEND_F_RING_PACKED
> feature (or something similar) to inform the user space that now we
> are able to handle packed virtqueue through vhost IOCTLs, otherwise
> how can the userspace know if it is supported or not?

I probably understand this but I think it should be done via
VHOST_GET_FEAETURES. It would be a burden if we matianing duplicated
features.

Thanks

>
> Thanks,
> Stefano
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-17 Thread Eric W. Biederman


Long story short.

In the patch below the first hunk is a noop.

The code you are bypassing was added to ensure that process termination
(aka SIGKILL) is processed before any other signals.  Other than signal
processing order there are not any substantive differences in the two
code paths.  With all signals except SIGSTOP == 19 and SIGKILL == 9
blocked SIGKILL should always be processed before SIGSTOP.

Can you try patch with just the last hunk that does
s/PF_IO_WORKER/PF_USER_WORKER/ and see if that is enough?

I have no objections to the final hunk.

Mike Christie  writes:

> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
> set when we are dealing with PF_USER_WORKER tasks.
>
> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
> We can easily stop new work/IO from being queued to the vhost_task, but
> for IO that's already been sent to something like the block layer we
> need to wait for the response then process it. These type of IO
> completions use the vhost_task to process the completion so we can't
> exit immediately.
>
> We need to handle wait for then handle those completions from the
> vhost_task, but when we have a SIGKLL pending, functions like
> schedule() return immediately so we can't wait like normal. Functions
> like vhost_worker() degrade to just a while(1); loop.
>
> This patch has get_signal drop down to the normal code path when
> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
> there is a SIGKILL but still perform some blocking cleanup.
>
> Note that in that chunk I'm now bypassing that does:
>
> sigdelset(¤t->pending.signal, SIGKILL);
>
> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
> group_exec_task we are already doing that on the threads in the
> group.
>
> Signed-off-by: Mike Christie 
> ---
>  kernel/signal.c | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..ae4972eea5db 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig)
>   struct k_sigaction *ka;
>   enum pid_type type;
>  
> - /* Has this task already been marked for death? */
> - if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> -  signal->group_exec_task) {
> + /*
> +  * Has this task already been marked for death?
> +  *
> +  * If this is a PF_USER_WORKER then the task may need to do
> +  * extra work that requires waiting on running work, so we want
> +  * to dequeue the signal below and tell the caller its time to
> +  * start its exit procedure. When the work has completed then
> +  * the task will exit.
> +  */
> + if (!(current->flags & PF_USER_WORKER) &&
> + ((signal->flags & SIGNAL_GROUP_EXIT) ||
> +  signal->group_exec_task)) {
>   clear_siginfo(&ksig->info);
>   ksig->info.si_signo = signr = SIGKILL;
>   sigdelset(¤t->pending.signal, SIGKILL);

This hunk is a confusing no-op.

> @@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig)
>   }
>  
>   /*
> -  * PF_IO_WORKER threads will catch and exit on fatal signals
> +  * PF_USER_WORKER threads will catch and exit on fatal signals
>* themselves. They have cleanup that must be performed, so
>* we cannot call do_exit() on their behalf.
>*/
> - if (current->flags & PF_IO_WORKER)
> + if (current->flags & PF_USER_WORKER)
>   goto out;
>  
>   /*

This hunk is good and makes sense.

Eric
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-17 Thread Eric W. Biederman
Mike Christie  writes:

> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
> set when we are dealing with PF_USER_WORKER tasks.

> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
> We can easily stop new work/IO from being queued to the vhost_task, but
> for IO that's already been sent to something like the block layer we
> need to wait for the response then process it. These type of IO
> completions use the vhost_task to process the completion so we can't
> exit immediately.


I understand the concern.

> We need to handle wait for then handle those completions from the
> vhost_task, but when we have a SIGKLL pending, functions like
> schedule() return immediately so we can't wait like normal. Functions
> like vhost_worker() degrade to just a while(1); loop.
>
> This patch has get_signal drop down to the normal code path when
> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
> there is a SIGKILL but still perform some blocking cleanup.
>
> Note that in that chunk I'm now bypassing that does:
>
> sigdelset(¤t->pending.signal, SIGKILL);
>
> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
> group_exec_task we are already doing that on the threads in the
> group.

What you are doing does not make any sense to me.

First there is the semantic non-sense, of queuing something that
is not a signal.   The per task SIGKILL bit is used as a flag with
essentially the same meaning as SIGNAL_GROUP_EXIT, reporting that
the task has been scheduled for exit.

More so is what happens afterwards.

As I read your patch it is roughly equivalent to doing:

if ((current->flags & PF_USER_WORKER) &&
fatal_signal_pending(current)) {
sigdelset(¤t->pending.signal, SIGKILL);
clear_siginfo(&ksig->info);
ksig->info.si_signo = SIGKILL;
ksig->info.si_code = SI_USER;
recalc_sigpending();
trace_signal_deliver(SIGKILL, &ksig->info,
&sighand->action[SIGKILL - 1]);
goto fatal;
}

Before the "(SIGNAL_GROUP_EXIT || signal->group_exec_task)" test.

To get that code I stripped the active statements out of the
dequeue_signal path the code executes after your change below.

I don't get why you are making it though because the code you
are opting out of does:

/* Has this task already been marked for death? */
if ((signal->flags & SIGNAL_GROUP_EXIT) ||
 signal->group_exec_task) {
clear_siginfo(&ksig->info);
ksig->info.si_signo = signr = SIGKILL;
sigdelset(¤t->pending.signal, SIGKILL);
trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
&sighand->action[SIGKILL - 1]);
recalc_sigpending();
goto fatal;
}

I don't see what in practice changes, other than the fact that by going
through the ordinary dequeue_signal path that other signals can be
processed after a SIGKILL has arrived.  Of course those signal all
should be blocked.




The trailing bit that expands the PF_IO_WORKER test to be PF_USER_WORKER
appears reasonable, and possibly needed.

Eric


> Signed-off-by: Mike Christie 
> ---
>  kernel/signal.c | 19 ++-
>  1 file changed, 14 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..ae4972eea5db 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig)
>   struct k_sigaction *ka;
>   enum pid_type type;
>  
> - /* Has this task already been marked for death? */
> - if ((signal->flags & SIGNAL_GROUP_EXIT) ||
> -  signal->group_exec_task) {
> + /*
> +  * Has this task already been marked for death?
> +  *
> +  * If this is a PF_USER_WORKER then the task may need to do
> +  * extra work that requires waiting on running work, so we want
> +  * to dequeue the signal below and tell the caller its time to
> +  * start its exit procedure. When the work has completed then
> +  * the task will exit.
> +  */
> + if (!(current->flags & PF_USER_WORKER) &&
> + ((signal->flags & SIGNAL_GROUP_EXIT) ||
> +  signal->group_exec_task)) {
>   clear_siginfo(&ksig->info);
>   ksig->info.si_signo = signr = SIGKILL;
>   sigdelset(¤t->pending.signal, SIGKILL);
> @@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig)
>   }
>  
>   /*
> -  * PF_IO_WORKER threads will catch and exit on fatal signals
> +  * PF_USER_WORKER threads will catch and

Re: [RFC PATCH 8/8] fork/vhost_task: remove no_files

2023-05-17 Thread Mike Christie
On 5/17/23 7:09 PM, Mike Christie wrote:
> +   CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND,

Sorry. I tried to throw this one in last second so we could see that
we can also see that we can now use CLONE_FILES like io_uring.
It will of course not compile.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler

2023-05-17 Thread Mike Christie
On 5/17/23 7:16 PM, Linus Torvalds wrote:
> On Wed, May 17, 2023 at 5:09 PM Mike Christie
>  wrote:
>>
>> +   __set_current_state(TASK_RUNNING);
>> +   rc = get_signal(&ksig);
>> +   set_current_state(TASK_INTERRUPTIBLE);
>> +   return rc;
> 
> The games with current_state seem nonsensical.
> 
> What are they all about? get_signal() shouldn't care, and no other
> caller does this thing. This just seems completely random.

Sorry. It's a leftover.

I was originally calling this from vhost_task_should_stop where before
calling that function we do a:

set_current_state(TASK_INTERRUPTIBLE);

So, I was hitting get_signal->try_to_freeze->might_sleep->__might_sleep
and was getting the "do not call blocking ops when !TASK_RUNNING"
warnings.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Re: Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods

2023-05-17 Thread zhenwei pi via Virtualization

On 5/17/23 18:39, Michael S. Tsirkin wrote:

On Wed, May 17, 2023 at 04:35:55PM +0800, zhenwei pi wrote:



On 5/17/23 15:46, Christoph Hellwig wrote:

On Wed, May 17, 2023 at 03:43:03PM +0800, zhenwei pi wrote:

I have a plan to introduce 'Virtio Over Fabrics'(TCP&RDMA) as Virtio
transport, as mentioned in cover letter of this series:
3 weeks ago, I posted a proposal 'Virtio Over Fabrics':
https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html


Just don't do it.  Please define your own protocols over RDMA or TCP
for exactly the operations you need (for many they will already exist)
instead of piggyg backing on virtio and making everyone else pay the
price.



Hi

1, `virtqueue_add_inbuf` in current version:
static inline int virtqueue_add_inbuf(struct virtqueue *vq,
   struct scatterlist *sg,
   unsigned int num,
   void *data,
   gfp_t gfp)
{
 if (likely(!vq->abstract))
 return vring_virtqueue_add_sgs(vq, &sg, num, 0, 1, data,
NULL, gfp);

 return vq->add_sgs(vq, &sg, num, 0, 1, data, NULL, gfp);
}

And disassemble 'virtinput_queue_evtbuf':
static void virtinput_queue_evtbuf(struct virtio_input *vi,
struct virtio_input_event *evtbuf)
{
 struct scatterlist sg[1];

 sg_init_one(sg, evtbuf, sizeof(*evtbuf));
 virtqueue_add_inbuf(vi->evt, sg, 1, evtbuf, GFP_ATOMIC);
}

I notice that two instructions are newly added for vring like:
  24d:   80 78 35 00 cmpb   $0x0,0x35(%rax)
  251:   75 3f   jne292

Is it an expensive price...


Can we somehow only override the kick method?
Then take the ring and send it over ...



Could you please take a look at this code?
https://github.com/pizhenwei/linux/blob/virtio-of-github/drivers/virtio/virtio_fabrics.c#LL861C13-L861C23




2, Storage/FS specific remote protocol is quite popular, otherwise I'm not
familiar with other device protocols. For example, I need a remote crypto
device to accelerate HTTPS ... With Virtio Over Fabrics, I have a chance to
attach a virtio-crypto device to do this work.

--
zhenwei pi




--
zhenwei pi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler

2023-05-17 Thread Linus Torvalds
On Wed, May 17, 2023 at 5:09 PM Mike Christie
 wrote:
>
> +   __set_current_state(TASK_RUNNING);
> +   rc = get_signal(&ksig);
> +   set_current_state(TASK_INTERRUPTIBLE);
> +   return rc;

The games with current_state seem nonsensical.

What are they all about? get_signal() shouldn't care, and no other
caller does this thing. This just seems completely random.

  Linus
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH 8/8] fork/vhost_task: remove no_files

2023-05-17 Thread Mike Christie
The vhost_task can now support the worker being freed from under the
device when we get a SIGKILL or the process exits without closing
devices. We no longer need no_files so this removes it.

Signed-off-by: Mike Christie 
---
 include/linux/sched/task.h |  1 -
 kernel/fork.c  | 10 ++
 kernel/vhost_task.c|  3 +--
 3 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 249a5ece9def..342fe297ffd4 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -28,7 +28,6 @@ struct kernel_clone_args {
u32 kthread:1;
u32 io_thread:1;
u32 user_worker:1;
-   u32 no_files:1;
u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9e04ab5c3946..f2c081c15efb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1769,8 +1769,7 @@ static int copy_fs(unsigned long clone_flags, struct 
task_struct *tsk)
return 0;
 }
 
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
- int no_files)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 {
struct files_struct *oldf, *newf;
int error = 0;
@@ -1782,11 +1781,6 @@ static int copy_files(unsigned long clone_flags, struct 
task_struct *tsk,
if (!oldf)
goto out;
 
-   if (no_files) {
-   tsk->files = NULL;
-   goto out;
-   }
-
if (clone_flags & CLONE_FILES) {
atomic_inc(&oldf->count);
goto out;
@@ -2488,7 +2482,7 @@ __latent_entropy struct task_struct *copy_process(
retval = copy_semundo(clone_flags, p);
if (retval)
goto bad_fork_cleanup_security;
-   retval = copy_files(clone_flags, p, args->no_files);
+   retval = copy_files(clone_flags, p);
if (retval)
goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index a11f036290cc..642047765190 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -96,12 +96,11 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), 
void *arg,
 {
struct kernel_clone_args args = {
.flags  = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
- CLONE_THREAD | CLONE_SIGHAND,
+ CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND,
.exit_signal= 0,
.fn = vhost_task_fn,
.name   = name,
.user_worker= 1,
-   .no_files   = 1,
.block_signals  = 1,
};
struct vhost_task *vtsk;
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works

2023-05-17 Thread Mike Christie
This moves the scsi code we use to stop new works from being queued
and wait on running works to a helper which is used by the vhost layer
when the vhost_task is being killed by a SIGKILL.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 40f9135e1a62..a0f2588270f2 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1768,6 +1768,19 @@ static int vhost_scsi_set_features(struct vhost_scsi 
*vs, u64 features)
return 0;
 }
 
+static void vhost_scsi_stop_dev_work(struct vhost_dev *dev)
+{
+   struct vhost_scsi *vs = container_of(dev, struct vhost_scsi, dev);
+   struct vhost_scsi_target t;
+
+   mutex_lock(&vs->dev.mutex);
+   memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn));
+   mutex_unlock(&vs->dev.mutex);
+   vhost_scsi_clear_endpoint(vs, &t);
+   vhost_dev_stop(&vs->dev);
+   vhost_dev_cleanup(&vs->dev);
+}
+
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
struct vhost_scsi *vs;
@@ -1821,7 +1834,7 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
}
vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0,
-  true, NULL, NULL);
+  true, NULL, vhost_scsi_stop_dev_work);
 
vhost_scsi_init_inflight(vs, NULL);
 
@@ -1843,14 +1856,8 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
 static int vhost_scsi_release(struct inode *inode, struct file *f)
 {
struct vhost_scsi *vs = f->private_data;
-   struct vhost_scsi_target t;
 
-   mutex_lock(&vs->dev.mutex);
-   memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn));
-   mutex_unlock(&vs->dev.mutex);
-   vhost_scsi_clear_endpoint(vs, &t);
-   vhost_dev_stop(&vs->dev);
-   vhost_dev_cleanup(&vs->dev);
+   vhost_dev_stop_work(&vs->dev);
kfree(vs->dev.vqs);
kfree(vs->vqs);
kfree(vs->old_inflight);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 7/8] vhost-net: Add callback to stop and wait on works

2023-05-17 Thread Mike Christie
This moves the net code we use to stop new works from being queued
and wait on running works to a helper which is used by the vhost layer
when the vhost_task is being killed by a SIGKILL.

Signed-off-by: Mike Christie 
---
 drivers/vhost/net.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 90c25127b3f8..f8a5527b15ba 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1325,9 +1325,9 @@ static void vhost_net_flush(struct vhost_net *n)
}
 }
 
-static int vhost_net_release(struct inode *inode, struct file *f)
+static void vhost_net_stop_dev_work(struct vhost_dev *dev)
 {
-   struct vhost_net *n = f->private_data;
+   struct vhost_net *n = container_of(dev, struct vhost_net, dev);
struct socket *tx_sock;
struct socket *rx_sock;
 
@@ -1345,6 +1345,13 @@ static int vhost_net_release(struct inode *inode, struct 
file *f)
/* We do an extra flush before freeing memory,
 * since jobs can re-queue themselves. */
vhost_net_flush(n);
+}
+
+static int vhost_net_release(struct inode *inode, struct file *f)
+{
+   struct vhost_net *n = f->private_data;
+
+   vhost_dev_stop_work(&n->dev);
kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
kfree(n->dev.vqs);
@@ -1409,7 +1416,7 @@ static int vhost_net_open(struct inode *inode, struct 
file *f)
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
   UIO_MAXIOV + VHOST_NET_BATCH,
   VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
-  NULL, NULL);
+  NULL, vhost_net_stop_dev_work);
 
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, 
dev);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones

2023-05-17 Thread Mike Christie
When the vhost_task gets a SIGKILL we want to stop new work from being
queued and also wait for and handle completions for running work. For the
latter, we still need to use the vhost_task to handle the completing work
so we can't just exit right away. But, this has us kick off the stopping
and flushing/stopping of the device/vhost_task/worker to the system work
queue while the vhost_task handles completions. When all completions are
done we will then do vhost_task_stop and we will exit.

Signed-off-by: Mike Christie 
---
 drivers/vhost/net.c   |  2 +-
 drivers/vhost/scsi.c  |  4 ++--
 drivers/vhost/test.c  |  3 ++-
 drivers/vhost/vdpa.c  |  2 +-
 drivers/vhost/vhost.c | 48 ---
 drivers/vhost/vhost.h | 10 -
 drivers/vhost/vsock.c |  4 ++--
 7 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8557072ff05e..90c25127b3f8 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1409,7 +1409,7 @@ static int vhost_net_open(struct inode *inode, struct 
file *f)
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
   UIO_MAXIOV + VHOST_NET_BATCH,
   VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
-  NULL);
+  NULL, NULL);
 
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, 
dev);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index bb10fa4bb4f6..40f9135e1a62 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1820,8 +1820,8 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
vqs[i] = &vs->vqs[i].vq;
vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
}
-   vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV,
-  VHOST_SCSI_WEIGHT, 0, true, NULL);
+   vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0,
+  true, NULL, NULL);
 
vhost_scsi_init_inflight(vs, NULL);
 
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 42c955a5b211..11a2823d7532 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -120,7 +120,8 @@ static int vhost_test_open(struct inode *inode, struct file 
*f)
vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
-  VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
+  VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL,
+  NULL);
 
f->private_data = n;
 
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 8c1aefc865f0..de9a83ecb70d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1279,7 +1279,7 @@ static int vhost_vdpa_open(struct inode *inode, struct 
file *filep)
vqs[i]->handle_kick = handle_vq_kick;
}
vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
-  vhost_vdpa_process_iotlb_msg);
+  vhost_vdpa_process_iotlb_msg, NULL);
 
r = vhost_vdpa_alloc_domain(v);
if (r)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1ba9e068b2ab..4163c86db50c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -336,6 +336,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
struct vhost_worker *worker = data;
+   struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;
 
@@ -352,12 +353,13 @@ static int vhost_worker(void *data)
if (!node) {
schedule();
/*
-* When we get a SIGKILL our release function will
-* be called. That will stop new IOs from being queued
-* and check for outstanding cmd responses. It will then
-* call vhost_task_stop to exit us.
+* When we get a SIGKILL we kick off a work to
+* run the driver's helper to stop new work and
+* handle completions. When they are done they will
+* call vhost_task_stop to tell us to exit.
 */
-   vhost_task_get_signal();
+   if (vhost_task_get_signal())
+   schedule_work(&dev->destroy_worker);
}
 
node = llist_reverse_order(node);
@@ -376,6 +378,33 @@ static int vhost_worker(void *data)
return 0;
 }
 
+static void __vhost_dev_stop_work(struct vhost_dev *dev)
+{
+   mutex_lock(&dev->stop_work_mutex);
+   if (dev->work_stopped)
+   goto done;
+
+   if (dev->stop_dev_wor

[RFC PATCH 4/8] vhost-net: Move vhost_net_open

2023-05-17 Thread Mike Christie
This moves vhost_net_open so in the next patches we can pass
vhost_dev_init a new helper which will use the stop/flush functions.
There is no functionality changes in this patch.

Signed-off-by: Mike Christie 
---
 drivers/vhost/net.c | 134 ++--
 1 file changed, 67 insertions(+), 67 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 07181cd8d52e..8557072ff05e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1285,73 +1285,6 @@ static void handle_rx_net(struct vhost_work *work)
handle_rx(net);
 }
 
-static int vhost_net_open(struct inode *inode, struct file *f)
-{
-   struct vhost_net *n;
-   struct vhost_dev *dev;
-   struct vhost_virtqueue **vqs;
-   void **queue;
-   struct xdp_buff *xdp;
-   int i;
-
-   n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
-   if (!n)
-   return -ENOMEM;
-   vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL);
-   if (!vqs) {
-   kvfree(n);
-   return -ENOMEM;
-   }
-
-   queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
- GFP_KERNEL);
-   if (!queue) {
-   kfree(vqs);
-   kvfree(n);
-   return -ENOMEM;
-   }
-   n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue;
-
-   xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL);
-   if (!xdp) {
-   kfree(vqs);
-   kvfree(n);
-   kfree(queue);
-   return -ENOMEM;
-   }
-   n->vqs[VHOST_NET_VQ_TX].xdp = xdp;
-
-   dev = &n->dev;
-   vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
-   vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
-   n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
-   n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
-   for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
-   n->vqs[i].ubufs = NULL;
-   n->vqs[i].ubuf_info = NULL;
-   n->vqs[i].upend_idx = 0;
-   n->vqs[i].done_idx = 0;
-   n->vqs[i].batched_xdp = 0;
-   n->vqs[i].vhost_hlen = 0;
-   n->vqs[i].sock_hlen = 0;
-   n->vqs[i].rx_ring = NULL;
-   vhost_net_buf_init(&n->vqs[i].rxq);
-   }
-   vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
-  UIO_MAXIOV + VHOST_NET_BATCH,
-  VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
-  NULL);
-
-   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, 
dev);
-   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
-
-   f->private_data = n;
-   n->page_frag.page = NULL;
-   n->refcnt_bias = 0;
-
-   return 0;
-}
-
 static struct socket *vhost_net_stop_vq(struct vhost_net *n,
struct vhost_virtqueue *vq)
 {
@@ -1421,6 +1354,73 @@ static int vhost_net_release(struct inode *inode, struct 
file *f)
return 0;
 }
 
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+   struct vhost_net *n;
+   struct vhost_dev *dev;
+   struct vhost_virtqueue **vqs;
+   void **queue;
+   struct xdp_buff *xdp;
+   int i;
+
+   n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+   if (!n)
+   return -ENOMEM;
+   vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL);
+   if (!vqs) {
+   kvfree(n);
+   return -ENOMEM;
+   }
+
+   queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
+ GFP_KERNEL);
+   if (!queue) {
+   kfree(vqs);
+   kvfree(n);
+   return -ENOMEM;
+   }
+   n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue;
+
+   xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL);
+   if (!xdp) {
+   kfree(vqs);
+   kvfree(n);
+   kfree(queue);
+   return -ENOMEM;
+   }
+   n->vqs[VHOST_NET_VQ_TX].xdp = xdp;
+
+   dev = &n->dev;
+   vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
+   vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
+   n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
+   n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
+   for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
+   n->vqs[i].ubufs = NULL;
+   n->vqs[i].ubuf_info = NULL;
+   n->vqs[i].upend_idx = 0;
+   n->vqs[i].done_idx = 0;
+   n->vqs[i].batched_xdp = 0;
+   n->vqs[i].vhost_hlen = 0;
+   n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rx_ring = NULL;
+   vhost_net_buf_init(&n->vqs[i].rxq);
+   }
+   vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
+  UIO_MAXIOV + VHOST_NET_BATCH,
+  VHOST_N

[RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-17 Thread Mike Christie
This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
set when we are dealing with PF_USER_WORKER tasks.

When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
We can easily stop new work/IO from being queued to the vhost_task, but
for IO that's already been sent to something like the block layer we
need to wait for the response then process it. These type of IO
completions use the vhost_task to process the completion so we can't
exit immediately.

We need to handle wait for then handle those completions from the
vhost_task, but when we have a SIGKLL pending, functions like
schedule() return immediately so we can't wait like normal. Functions
like vhost_worker() degrade to just a while(1); loop.

This patch has get_signal drop down to the normal code path when
SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
there is a SIGKILL but still perform some blocking cleanup.

Note that in that chunk I'm now bypassing that does:

sigdelset(¤t->pending.signal, SIGKILL);

we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
group_exec_task we are already doing that on the threads in the
group.

Signed-off-by: Mike Christie 
---
 kernel/signal.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..ae4972eea5db 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig)
struct k_sigaction *ka;
enum pid_type type;
 
-   /* Has this task already been marked for death? */
-   if ((signal->flags & SIGNAL_GROUP_EXIT) ||
-signal->group_exec_task) {
+   /*
+* Has this task already been marked for death?
+*
+* If this is a PF_USER_WORKER then the task may need to do
+* extra work that requires waiting on running work, so we want
+* to dequeue the signal below and tell the caller its time to
+* start its exit procedure. When the work has completed then
+* the task will exit.
+*/
+   if (!(current->flags & PF_USER_WORKER) &&
+   ((signal->flags & SIGNAL_GROUP_EXIT) ||
+signal->group_exec_task)) {
clear_siginfo(&ksig->info);
ksig->info.si_signo = signr = SIGKILL;
sigdelset(¤t->pending.signal, SIGKILL);
@@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig)
}
 
/*
-* PF_IO_WORKER threads will catch and exit on fatal signals
+* PF_USER_WORKER threads will catch and exit on fatal signals
 * themselves. They have cleanup that must be performed, so
 * we cannot call do_exit() on their behalf.
 */
-   if (current->flags & PF_IO_WORKER)
+   if (current->flags & PF_USER_WORKER)
goto out;
 
/*
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler

2023-05-17 Thread Mike Christie
This patch has vhost use get_signal to handle freezing and sort of
handle signals. By the latter I mean that when we get SIGKILL, our
parent will exit and call our file_operatons release function. That will
then stop new work from breing queued and wait for the vhost_task to
handle completions for running IO. We then exit when those are done.

The next patches will then have us work more like io_uring where
we handle the get_signal return value and key off that to cleanup.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c| 10 +-
 include/linux/sched/vhost_task.h |  1 +
 kernel/vhost_task.c  | 20 
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..1ba9e068b2ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -349,8 +349,16 @@ static int vhost_worker(void *data)
}
 
node = llist_del_all(&worker->work_list);
-   if (!node)
+   if (!node) {
schedule();
+   /*
+* When we get a SIGKILL our release function will
+* be called. That will stop new IOs from being queued
+* and check for outstanding cmd responses. It will then
+* call vhost_task_stop to exit us.
+*/
+   vhost_task_get_signal();
+   }
 
node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 6123c10b99cf..54b68115eb3b 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void 
*arg,
 void vhost_task_start(struct vhost_task *vtsk);
 void vhost_task_stop(struct vhost_task *vtsk);
 bool vhost_task_should_stop(struct vhost_task *vtsk);
+bool vhost_task_get_signal(void);
 
 #endif
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..a661cfa32ba3 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -61,6 +61,26 @@ bool vhost_task_should_stop(struct vhost_task *vtsk)
 }
 EXPORT_SYMBOL_GPL(vhost_task_should_stop);
 
+/**
+ * vhost_task_get_signal - Check if there are pending signals
+ *
+ * Return true if we got SIGKILL.
+ */
+bool vhost_task_get_signal(void)
+{
+   struct ksignal ksig;
+   bool rc;
+
+   if (!signal_pending(current))
+   return false;
+
+   __set_current_state(TASK_RUNNING);
+   rc = get_signal(&ksig);
+   set_current_state(TASK_INTERRUPTIBLE);
+   return rc;
+}
+EXPORT_SYMBOL_GPL(vhost_task_get_signal);
+
 /**
  * vhost_task_create - create a copy of a process to be used by the kernel
  * @fn: thread stack
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND

2023-05-17 Thread Mike Christie
This is a modified version of Linus's patch which has vhost_task
use CLONE_THREAD and CLONE_SIGHAND and allow SIGKILL and SIGSTOP.

I renamed the ignore_signals to block_signals based on Linus's comment
where it aligns with what we are doing with the siginitsetinv
p->blocked use and no longer calling ignore_signals.

Signed-off-by: Mike Christie 
---
 include/linux/sched/task.h |  2 +-
 kernel/fork.c  | 12 +++-
 kernel/vhost_task.c|  5 +++--
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..249a5ece9def 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,7 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
-   u32 ignore_signals:1;
+   u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9e04ab5c3946 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
p->flags |= PF_KTHREAD;
if (args->user_worker)
p->flags |= PF_USER_WORKER;
-   if (args->io_thread) {
-   /*
-* Mark us an IO worker, and block any signal that isn't
-* fatal or STOP
-*/
+   if (args->io_thread)
p->flags |= PF_IO_WORKER;
+   if (args->block_signals)
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
-   }
 
if (args->name)
strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
 
-   if (args->ignore_signals)
-   ignore_signals(p);
-
stackleak_task_init(p);
 
if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), 
void *arg, int node)
.fn_arg = arg,
.io_thread  = 1,
.user_worker= 1,
+   .block_signals  = 1,
};
 
return copy_process(NULL, 0, node, &args);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index a661cfa32ba3..a11f036290cc 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -95,13 +95,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), 
void *arg,
 const char *name)
 {
struct kernel_clone_args args = {
-   .flags  = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+   .flags  = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal= 0,
.fn = vhost_task_fn,
.name   = name,
.user_worker= 1,
.no_files   = 1,
-   .ignore_signals = 1,
+   .block_signals  = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND

2023-05-17 Thread Mike Christie
This patch allows the vhost and vhost_task code to use CLONE_THREAD,
CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
normal testing, haven't coverted vsock and vdpa, and I know you guys
will not like the first patch. However, I think it better shows what
we need from the signal code and how we can support signals in the
vhost_task layer.

Note that I took the super simple route and kicked off some work to
the system workqueue. We can do more invassive approaches:
1. Modify the vhost drivers so they can check for IO completions using
a non-blocking interface. We then don't need to run from the system
workqueue and can run from the vhost_task.

2. We could drop patch 1 and just say we are doing a polling type
of approach. We then modify the vhost layer similar to #1 where we
can check for completions using a non-blocking interface and use
the vhost_task task.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

2023-05-17 Thread Mike Christie
On 5/17/23 12:09 PM, Oleg Nesterov wrote:
> IIUC, Mike is going to send the next version? So I think we can delay
> the further discussions until then.

Yeah, I'm working on a version that supports signals so it will be easier
to discuss with the vhost devs and you, Christian and Eric.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

2023-05-17 Thread Oleg Nesterov
On 05/16, Eric W. Biederman wrote:
>
> Oleg Nesterov  writes:
>
> >> There is this bit in complete_signal when SIGKILL is delivered to any
> >> thread in the process.
> >>
> >>t = p;
> >>do {
> >>task_clear_jobctl_pending(t, 
> >> JOBCTL_PENDING_MASK);
> >>sigaddset(&t->pending.signal, SIGKILL);
> >>signal_wake_up(t, 1);
> >>} while_each_thread(p, t);
> >
> > That is why the latest version adds try_set_pending_sigkill(). No, no,
> > it is not that I think this is a good idea.
>
> I see that try_set_pending_sigkill in the patch now.
>
> That try_set_pending_sigkill just keeps the process from reporting
> that it has exited, and extend the process exit indefinitely.
>
> SIGNAL_GROUP_EXIT has already been set, so the KILL signal was
> already delivered and the process is exiting.

Agreed, that is why I said I don't think try_set_pending_sigkill() is
a good idea.

And again, the same is true for the threads created by
create_io_thread(). get_signal() from io_uring/ can dequeue a pending
SIGKILL and return, but that is all.

> >> For clarity that sigaddset(&t->pending.signal, SIGKILL);  Really isn't
> >> setting SIGKILL pending,
> >
> > Hmm. it does? Nevermind.
>
> The point is that what try_set_pending_sigkill in the patch is doing is
> keeping the "you are dead exit now" flag, from being set.
>
> That flag is what fatal_signal_pending always tests, because we can only
> know if a fatal signal is pending if we have performed short circuit
> delivery on the signal.
>
> The result is the effects of the change are mostly what people expect.
> The difference the semantics being changed aren't what people think they
> are.
>
> AKA process exit is being ignored for the thread, not that SIGKILL is
> being blocked.

Sorry, I don't understand. I just tried to say that
sigaddset(&t->pending.signal, SIGKILL) really sets SIGKILL pending.
Nevermind.

> > Although I never understood this logic.

I meant I never really liked how io-threads play with signals,

> I can't even understand the usage
> > of lower_32_bits() in create_io_thread().
>
> As far as I can tell lower_32_bits(flags) is just defensive programming

Cough. but this is ugly. Or I missed something.

> or have just populated .flags directly.

Exactly,

> Then .exit_signal
> could have been set to 0.

Exactly.

---
OK. It doesn't matter. I tried to read the whole thread and got lost.

IIUC, Mike is going to send the next version? So I think we can delay
the further discussions until then.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 0/7] riscv: Memory Hot(Un)Plug support

2023-05-17 Thread David Hildenbrand

On 12.05.23 16:57, Björn Töpel wrote:

From: Björn Töpel 

Memory Hot(Un)Plug support for the RISC-V port
==

Introduction


To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory
hot(un)plug allows for increasing and decreasing the size of physical
memory available to a machine at runtime."

This series attempts to add memory hot(un)plug support for the RISC-V
Linux port.

I'm sending the series as a v1, but it's borderline RFC. It definitely
needs more testing time, but it would be nice with some early input.

Implementation
--

 From an arch perspective, a couple of callbacks needs to be
implemented to support hot plugging:

arch_add_memory()
This callback is responsible for updating the linear/direct map, and
call into the memory hot plugging generic code via __add_pages().

arch_remove_memory()
In this callback the linear/direct map is tore down.

vmemmap_free()
The function tears down the vmemmap mappings (if
CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing
vmemmap pages. Note that for persistent memory, an alternative
allocator for the backing pages can be used -- the vmem_altmap. This
means that when the backing pages are cleared, extra care is needed so
that the correct deallocation method is used. Note that RISC-V
populates the vmemmap using vmemmap_populate_basepages(), so currently
no hugepages are used for the backing store.

The page table unmap/teardown functions are heavily based (copied!)
from the x86 tree. The same remove_pgd_mapping() is used in both
vmemmap_free() and arch_remove_memory(), but in the latter function
the backing pages are not removed.

On RISC-V, the PGD level kernel mappings needs to synchronized with
all page-tables (e.g. via sync_kernel_mappings()). Synchronization
involves special care, like locking. Instead, this patch series takes
a different approach (introduced by Jörg Rödel in the x86-tree);
Pre-allocate the PGD-leaves (P4D, PUD, or PMD depending on the paging
setup) at mem_init(), for vmemmap and the direct map.

Pre-allocating the PGD-leaves waste some memory, but is only enabled
for CONFIG_MEMORY_HOTPLUG. The number pages, potentially unused, are
~128 * 4K.

Patch 1: Preparation for hotplugging support, by pre-allocating the
  PGD leaves.

Patch 2: Changes the __init attribute to __meminit, to avoid that the
  functions are removed after init. __meminit keeps the
  functions after init, if memory hotplugging is enabled for
  the build.
  
Patch 3: Refactor the direct map setup, so it can be used for hot add.


Patch 4: The actual add/remove code. Mostly a page-table-walk
  exercise.

Patch 5: Turn on the arch support in Kconfig

Patch 6: Now that memory hotplugging is enabled, make virtio-mem
  usable for RISC-V
  
Patch 7: Pre-allocate vmalloc PGD-leaves as well, which removes the

  need for vmalloc faulting.
  
RFC

---

  * TLB flushes. The current series uses Big Hammer flush-it-all.
  * Pre-allocation vs explicit syncs

Testing
---

ACPI support is still in the making for RISC-V, so tests that involve
CXL and similar fanciness is currently not possible. Virtio-mem,
however, works without proper ACPI support. In order to try this out
in Qemu, some additional patches for Qemu are needed:

  * Enable virtio-mem for RISC-V
  * Add proper hotplug support for virtio-mem
  
The patch for Qemu can be found is commit 5d90a7ef1bc0

("hw/riscv/virt: Support for virtio-mem-pci"), and can be found here

   https://github.com/bjoto/qemu/tree/riscv-virtio-mem

I will try to upstream that work in parallel with this.
   
Thanks to David Hildenbrand for valuable input for the Qemu side of

things.

The series is based on the RISC-V fixes tree
   https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=fixes



Cool stuff! I'm fairly busy right now, so some high-level questions upfront:

What is the memory section size (which implies the memory block size 
and)? This implies the minimum DIMM granularity and the high-level 
granularity in which virtio-mem adds memory.


What is the pageblock size, implying the minimum granularity that 
virtio-mem can operate on?


On x86-64 and arm64 we currently use the ACPI SRAT to expose the maximum 
physical address where we can see memory getting hotplugged. [1] From 
that, we can derive the "max_possible_pfn" and prepare the kernel 
virtual memory layourt (especially, direct map).


Is something similar required on RISC-V? On s390x, I'm planning on 
adding a paravirtualized mechanism to detect where memory devices might 
be located. (I had a running RFC, but was distracted by all other kinds 
of stuff)



[1] https://virtio-mem.gitlab.io/developer-guide.html

--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.or

Re: Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods

2023-05-17 Thread Michael S. Tsirkin
On Wed, May 17, 2023 at 04:35:55PM +0800, zhenwei pi wrote:
> 
> 
> On 5/17/23 15:46, Christoph Hellwig wrote:
> > On Wed, May 17, 2023 at 03:43:03PM +0800, zhenwei pi wrote:
> > > I have a plan to introduce 'Virtio Over Fabrics'(TCP&RDMA) as Virtio
> > > transport, as mentioned in cover letter of this series:
> > > 3 weeks ago, I posted a proposal 'Virtio Over Fabrics':
> > > https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html
> > 
> > Just don't do it.  Please define your own protocols over RDMA or TCP
> > for exactly the operations you need (for many they will already exist)
> > instead of piggyg backing on virtio and making everyone else pay the
> > price.
> > 
> 
> Hi
> 
> 1, `virtqueue_add_inbuf` in current version:
> static inline int virtqueue_add_inbuf(struct virtqueue *vq,
>   struct scatterlist *sg,
>   unsigned int num,
>   void *data,
>   gfp_t gfp)
> {
> if (likely(!vq->abstract))
> return vring_virtqueue_add_sgs(vq, &sg, num, 0, 1, data,
> NULL, gfp);
> 
> return vq->add_sgs(vq, &sg, num, 0, 1, data, NULL, gfp);
> }
> 
> And disassemble 'virtinput_queue_evtbuf':
> static void virtinput_queue_evtbuf(struct virtio_input *vi,
>struct virtio_input_event *evtbuf)
> {
> struct scatterlist sg[1];
> 
> sg_init_one(sg, evtbuf, sizeof(*evtbuf));
> virtqueue_add_inbuf(vi->evt, sg, 1, evtbuf, GFP_ATOMIC);
> }
> 
> I notice that two instructions are newly added for vring like:
>  24d:   80 78 35 00 cmpb   $0x0,0x35(%rax)
>  251:   75 3f   jne292
> 
> Is it an expensive price...

Can we somehow only override the kick method?
Then take the ring and send it over ...


> 2, Storage/FS specific remote protocol is quite popular, otherwise I'm not
> familiar with other device protocols. For example, I need a remote crypto
> device to accelerate HTTPS ... With Virtio Over Fabrics, I have a chance to
> attach a virtio-crypto device to do this work.
> 
> -- 
> zhenwei pi

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Re: Re: Re: [PATCH v2 0/2] virtio: abstract virtqueue related methods

2023-05-17 Thread Michael S. Tsirkin
On Wed, May 17, 2023 at 02:21:09PM +0800, zhenwei pi wrote:
> On 5/17/23 14:10, Michael S. Tsirkin wrote:
> > On Wed, May 17, 2023 at 12:58:10PM +0800, zhenwei pi wrote:
> > > On 5/17/23 11:57, Michael S. Tsirkin wrote:
> > > > On Wed, May 17, 2023 at 11:51:03AM +0800, zhenwei pi wrote:
> > > > > 
> > > > > 
> > > > > On 5/17/23 11:46, Michael S. Tsirkin wrote:
> > > > > > On Wed, May 17, 2023 at 10:54:22AM +0800, zhenwei pi wrote:
> > > > > > > v1 -> v2:
> > > > > > > - Suggested by MST, use fast path for vring based performance
> > > > > > > sensitive API.
> > > > > > > - Reduce changes in tools/virtio.
> > > > > > > 
> > > > > > > Add test result(no obvious change):
> > > > > > > Before:
> > > > > > > time ./vringh_test --parallel
> > > > > > > Using CPUS 0 and 191
> > > > > > > Guest: notified 10036893, pinged 68278
> > > > > > > Host: notified 68278, pinged 3093532
> > > > > > > 
> > > > > > > real  0m14.463s
> > > > > > > user  0m6.437s
> > > > > > > sys   0m8.010s
> > > > > > > 
> > > > > > > After:
> > > > > > > time ./vringh_test --parallel
> > > > > > > Using CPUS 0 and 191
> > > > > > > Guest: notified 10036709, pinged 68347
> > > > > > > Host: notified 68347, pinged 3085292
> > > > > > > 
> > > > > > > real  0m14.196s
> > > > > > > user  0m6.289s
> > > > > > > sys   0m7.885s
> > > > > > > 
> > > > > > > v1:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > 3 weeks ago, I posted a proposal 'Virtio Over Fabrics':
> > > > > > > https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html
> > > > > > > 
> > > > > > > Jason and Stefan pointed out that a non-vring based virtqueue has 
> > > > > > > a
> > > > > > > chance to overwrite virtqueue instead of using vring virtqueue.
> > > > > > > 
> > > > > > > Then I try to abstract virtqueue related methods in this series, 
> > > > > > > the
> > > > > > > details changes see the comment of patch 'virtio: abstract 
> > > > > > > virtqueue related methods'.
> > > > > > > 
> > > > > > > Something is still remained:
> > > > > > > - __virtqueue_break/__virtqueue_unbreak is supposed to use by 
> > > > > > > internal
> > > > > > >  virtio core, I'd like to rename them to vring_virtqueue_break
> > > > > > >  /vring_virtqueue_unbreak. Is this reasonable?
> > > > > > 
> > > > > > Why? These just set a flag?
> > > > > > 
> > > > > 
> > > > > Rename '__virtqueue_break' to 'vring_virtqueue_break', to make symbols
> > > > > exported from virtio_ring.ko have unified prefix 
> > > > > 'vring_virtqueue_xxx'.
> > > > 
> > > > I just do not see why you need these callbacks at all.
> > > > 
> > > 
> > > I use these callbacks for break/unbreak device like:
> > > static inline void virtio_break_device(struct virtio_device *dev)
> > > {
> > >   struct virtqueue *vq;
> > > 
> > >   spin_lock(&dev->vqs_list_lock);
> > >   list_for_each_entry(vq, &dev->vqs, list) {
> > >   vq->__break(vq);
> > >   }
> > >   spin_unlock(&dev->vqs_list_lock);
> > > }
> > 
> > why do this? backend knows they are broken.
> > 
> 
> I grep 'virtio_break_device' in the latest code:
> arch/um/drivers/virtio_uml.c:1147:virtio_break_device(&vu_dev->vdev);
> arch/um/drivers/virtio_uml.c:1285:virtio_break_device(&vu_dev->vdev);
> drivers/crypto/virtio/virtio_crypto_core.c:269:   
> virtio_break_device(vcrypto->vdev);
> drivers/s390/virtio/virtio_ccw.c:1251:
> virtio_break_device(&vcdev->vdev);
> drivers/s390/virtio/virtio_ccw.c:1268:
> virtio_break_device(&vcdev->vdev);
> drivers/firmware/arm_scmi/virtio.c:489:
> virtio_break_device(vioch->vqueue->vdev);
> drivers/char/virtio_console.c:1956:   virtio_break_device(vdev);
> 
> Some virtio drivers use 'virtio_break_device'...

You should read the code and understand what it does,
not just grep things and make assumptions.
What virtio_break_device does is stop linux from sending
new requests.


> > > > > > > - 
> > > > > > > virtqueue_get_desc_addr/virtqueue_get_avail_addr/virtqueue_get_used_addr
> > > > > > >  /virtqueue_get_vring is vring specific, I'd like to rename 
> > > > > > > them like
> > > > > > >  vring_virtqueue_get_desc_addr. Is this reasonable?
> > > > > > > - there are still some functions in virtio_ring.c with prefix 
> > > > > > > *virtqueue*,
> > > > > > >  for example 'virtqueue_add_split', just keep it or rename it 
> > > > > > > to
> > > > > > >  'vring_virtqueue_add_split'?
> > > > > > > zhenwei pi (2):
> > > > > > >  virtio: abstract virtqueue related methods
> > > > > > >  tools/virtio: implement virtqueue in test
> > > > > > > 
> > > > > > > drivers/virtio/virtio_ring.c | 285 +-
> > > > > > > include/linux/virtio.h   | 441 
> > > > > > > +++
> > > > > > > include/linux/virtio_ring.h  |  26 +++
> > > > > > > tools/virtio/linux/virtio.h  | 355 
> > > > > > > +---
> > > > > > > 4 files changed, 807 insertions(+), 300 d

Re: Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods

2023-05-17 Thread zhenwei pi via Virtualization




On 5/17/23 15:46, Christoph Hellwig wrote:

On Wed, May 17, 2023 at 03:43:03PM +0800, zhenwei pi wrote:

I have a plan to introduce 'Virtio Over Fabrics'(TCP&RDMA) as Virtio
transport, as mentioned in cover letter of this series:
3 weeks ago, I posted a proposal 'Virtio Over Fabrics':
https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html


Just don't do it.  Please define your own protocols over RDMA or TCP
for exactly the operations you need (for many they will already exist)
instead of piggyg backing on virtio and making everyone else pay the
price.



Hi

1, `virtqueue_add_inbuf` in current version:
static inline int virtqueue_add_inbuf(struct virtqueue *vq,
  struct scatterlist *sg,
  unsigned int num,
  void *data,
  gfp_t gfp)
{
if (likely(!vq->abstract))
return vring_virtqueue_add_sgs(vq, &sg, num, 0, 1, 
data, NULL, gfp);


return vq->add_sgs(vq, &sg, num, 0, 1, data, NULL, gfp);
}

And disassemble 'virtinput_queue_evtbuf':
static void virtinput_queue_evtbuf(struct virtio_input *vi,
   struct virtio_input_event *evtbuf)
{
struct scatterlist sg[1];

sg_init_one(sg, evtbuf, sizeof(*evtbuf));
virtqueue_add_inbuf(vi->evt, sg, 1, evtbuf, GFP_ATOMIC);
}

I notice that two instructions are newly added for vring like:
 24d:   80 78 35 00 cmpb   $0x0,0x35(%rax)
 251:   75 3f   jne292

Is it an expensive price...

2, Storage/FS specific remote protocol is quite popular, otherwise I'm 
not familiar with other device protocols. For example, I need a remote 
crypto device to accelerate HTTPS ... With Virtio Over Fabrics, I have a 
chance to attach a virtio-crypto device to do this work.


--
zhenwei pi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods

2023-05-17 Thread Christoph Hellwig
On Wed, May 17, 2023 at 03:43:03PM +0800, zhenwei pi wrote:
> I have a plan to introduce 'Virtio Over Fabrics'(TCP&RDMA) as Virtio
> transport, as mentioned in cover letter of this series:
> 3 weeks ago, I posted a proposal 'Virtio Over Fabrics':
> https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html

Just don't do it.  Please define your own protocols over RDMA or TCP
for exactly the operations you need (for many they will already exist)
instead of piggyg backing on virtio and making everyone else pay the
price.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods

2023-05-17 Thread zhenwei pi via Virtualization




On 5/17/23 15:39, Christoph Hellwig wrote:

On Wed, May 17, 2023 at 10:54:23AM +0800, zhenwei pi wrote:

All the vring based virtqueue methods could be abstratct in theory,
MST suggested that add/get bufs and kick functions are quite perfmance
sensitive, so export these functions from virtio_ring.ko, drivers
still call them in a fast path.


Who is going to use this?  And why do you think every virtio users
would want to pay for indirect calls just for your shiny new feature?

This seems like an amazingly bad idea to me.


Hi,

I have a plan to introduce 'Virtio Over Fabrics'(TCP&RDMA) as Virtio 
transport, as mentioned in cover letter of this series:

3 weeks ago, I posted a proposal 'Virtio Over Fabrics':
https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html

Jason and Stefan pointed out that a non-vring based virtqueue has a
chance to overwrite virtqueue instead of using vring virtqueue.

--
zhenwei pi
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods

2023-05-17 Thread Christoph Hellwig
On Wed, May 17, 2023 at 10:54:23AM +0800, zhenwei pi wrote:
> All the vring based virtqueue methods could be abstratct in theory,
> MST suggested that add/get bufs and kick functions are quite perfmance
> sensitive, so export these functions from virtio_ring.ko, drivers
> still call them in a fast path.

Who is going to use this?  And why do you think every virtio users
would want to pay for indirect calls just for your shiny new feature?

This seems like an amazingly bad idea to me.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/3] vhost: support PACKED when setting-getting vring_base

2023-05-17 Thread Stefano Garzarella
On Wed, May 17, 2023 at 7:26 AM Jason Wang  wrote:
>
> On Wed, May 17, 2023 at 2:26 AM Shannon Nelson  wrote:
> >
> > On 5/16/23 12:49 AM, Stefano Garzarella wrote:
> > > On Mon, May 15, 2023 at 01:41:12PM -0700, Shannon Nelson wrote:
> > >> On 5/9/23 1:46 AM, Stefano Garzarella wrote:
> > >>> On Mon, Apr 24, 2023 at 03:50:30PM -0700, Shannon Nelson via
> > >>> Virtualization wrote:
> >  Use the right structs for PACKED or split vqs when setting and
> >  getting the vring base.
> > 
> >  Signed-off-by: Shannon Nelson 
> >  ---
> >  drivers/vhost/vhost.c | 18 +-
> >  drivers/vhost/vhost.h |  8 ++--
> >  2 files changed, 19 insertions(+), 7 deletions(-)
> > 
> >  diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >  index f11bdbe4c2c5..f64efda48f21 100644
> >  --- a/drivers/vhost/vhost.c
> >  +++ b/drivers/vhost/vhost.c
> >  @@ -1633,17 +1633,25 @@ long vhost_vring_ioctl(struct vhost_dev
> >  *d, unsigned int ioctl, void __user *arg
> >    r = -EFAULT;
> >    break;
> >    }
> >  -  if (s.num > 0x) {
> >  -  r = -EINVAL;
> >  -  break;
> >  +  if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) {
> >  +  vq->last_avail_idx = s.num & 0x;
> >  +  vq->last_used_idx = (s.num >> 16) & 0x;
> >  +  } else {
> >  +  if (s.num > 0x) {
> >  +  r = -EINVAL;
> >  +  break;
> >  +  }
> >  +  vq->last_avail_idx = s.num;
> >    }
> >  -  vq->last_avail_idx = s.num;
> >    /* Forget the cached index value. */
> >    vq->avail_idx = vq->last_avail_idx;
> >    break;
> >    case VHOST_GET_VRING_BASE:
> >    s.index = idx;
> >  -  s.num = vq->last_avail_idx;
> >  +  if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED))
> >  +  s.num = (u32)vq->last_avail_idx |
> >  ((u32)vq->last_used_idx << 16);
> >  +  else
> >  +  s.num = vq->last_avail_idx;
> > >>>
> > >>> The changes LGTM, but since we are changing the UAPI, should we
> > >>> update the documentation of VHOST_SET_VRING_BASE and
> > >>> VHOST_GET_VRING_BASE in include/uapi/linux/vhost.h?
> > >>
> > >> Correct me if I'm wrong, but I don't think we're changing anything in
> > >> the UAPI here, just fixing code to work correctly with what is already
> > >> happening.
> > >
> > > IIUC before this patch VHOST_GET_VRING_BASE and VHOST_SET_VRING_BASE
> > > never worked with packed virtqueue, since we were only handling
> > > last_avail_idx. Now we are supporting packed virtqueue, handling
> > > in vhost_vring_state.num both last_avail_idx and last_used_idx (with
> > > wrap counters).
> > >
> > > For example for VHOST_GET_VRING_BASE where is documented that the first
> > > 15 bits are last_avail_idx, the 16th the avail_wrap_counter, and the
> > > others are last_used_idx and used_wrap_counter?
> > >
> > > Maybe I missed something, but since this is UAPI, IMHO we should
> > > document the parameters of ioctls at least in
> > > include/uapi/linux/vhost.h.
> >
> > Perhaps Jason already has something written up that could be put in here
> > from when he first added the wrap_counter a couple of years ago?
>
> If you meant the virtio driver support for packed, I think it's
> different from the context which is vhost here.
>
> I agree with Stefano that we need to update the comments around
> GET_VRING_BASE and SET_VRING_BASE, then we are fine.

I'm thinking if we should also add a new VHOST_BACKEND_F_RING_PACKED
feature (or something similar) to inform the user space that now we
are able to handle packed virtqueue through vhost IOCTLs, otherwise
how can the userspace know if it is supported or not?

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization