Re: [PATCH vhost v9 05/12] virtio_ring: split: virtqueue_add_split() support premapped
On Wed, May 17, 2023 at 10:23 AM Xuan Zhuo wrote: > > virtqueue_add_split() only supports virtual addresses, dma is completed > in virtqueue_add_split(). > > In some scenarios (such as the AF_XDP scenario), the memory is allocated > and DMA is completed in advance, so it is necessary for us to support > passing the DMA address to virtqueue_add_split(). > > Record this information in desc_state, we can skip unmap based on this > when executing dma unmap. I would also suggest documenting why a per descriptor metadata is needed instead of a per virtqueue one. > > Signed-off-by: Xuan Zhuo > --- > drivers/virtio/virtio_ring.c | 38 +++- > 1 file changed, 29 insertions(+), 9 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index e2fc50c05bec..bd5e84afab37 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -70,6 +70,7 @@ > struct vring_desc_state_split { > void *data; /* Data for callback. */ > struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ > + bool premapped; /* DMA mapping is done by driver. */ Going back to the original discussion around where this should be placed. I wonder if we can find a common place to store this since it has nothing related to virtqueue layout. Maybe desc_extra? And it would be even better if we can avoid stressing the cache like above. > }; > > struct vring_desc_state_packed { > @@ -356,8 +357,14 @@ static struct device *vring_dma_dev(const struct > vring_virtqueue *vq) > > /* Map one sg entry. */ > static int vring_map_one_sg(const struct vring_virtqueue *vq, struct > scatterlist *sg, > - enum dma_data_direction direction, static > dma_addr_t *addr) > + enum dma_data_direction direction, > + bool premapped, dma_addr_t *addr) having things like: int func(bool do) { if (!do) return; } is a hint that the check needs to be done by the caller? And this change should work for both packed and split. I think we need to squash the packed changes here. Looking at how packed virtqueue uses this in this patch, I don't think this patch can even be built. I will wait for a new version and continue the review from there. Thanks > { > + if (premapped) { > + *addr = sg_dma_address(sg); > + return 0; > + } > + > if (!vq->use_dma_api) { > /* > * If DMA is not used, KMSAN doesn't know that the scatterlist > @@ -445,7 +452,7 @@ static void vring_unmap_one_split_indirect(const struct > vring_virtqueue *vq, > } > > static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq, > - unsigned int i) > + unsigned int i, bool premapped) > { > struct vring_desc_extra *extra = vq->split.desc_extra; > u16 flags; > @@ -462,6 +469,9 @@ static unsigned int vring_unmap_one_split(const struct > vring_virtqueue *vq, > (flags & VRING_DESC_F_WRITE) ? > DMA_FROM_DEVICE : DMA_TO_DEVICE); > } else { > + if (premapped) > + goto out; > + > dma_unmap_page(vring_dma_dev(vq), >extra[i].addr, >extra[i].len, > @@ -532,6 +542,7 @@ static inline int virtqueue_add_split(struct virtqueue > *_vq, > unsigned int in_sgs, > void *data, > void *ctx, > + bool premapped, > gfp_t gfp) > { > struct vring_virtqueue *vq = to_vvq(_vq); > @@ -595,7 +606,7 @@ static inline int virtqueue_add_split(struct virtqueue > *_vq, > for (sg = sgs[n]; sg; sg = sg_next(sg)) { > dma_addr_t addr; > > - if (vring_map_one_sg(vq, sg, DMA_TO_DEVICE, &addr)) > + if (vring_map_one_sg(vq, sg, DMA_TO_DEVICE, > premapped, &addr)) > goto unmap_release; > > prev = i; > @@ -611,7 +622,7 @@ static inline int virtqueue_add_split(struct virtqueue > *_vq, > for (sg = sgs[n]; sg; sg = sg_next(sg)) { > dma_addr_t addr; > > - if (vring_map_one_sg(vq, sg, DMA_FROM_DEVICE, &addr)) > + if (vring_map_one_sg(vq, sg, DMA_FROM_DEVICE, > premapped, &addr)) > goto unmap_release; > > prev = i; > @@ -657,6 +668,7 @@ static inline int virtqueue_add_split(struct virtqueue > *_vq, > > /* Store token and indirect buffer state. */ > vq->split.
Re: [PATCH vhost v9 04/12] virtio_ring: virtqueue_add() support premapped
On Wed, May 17, 2023 at 10:23 AM Xuan Zhuo wrote: > > virtuque_add() adds parameter premapped. I wonder if this patch is over simplified. Maybe it can be squashed with the patch that implements the premapped logic. Thanks > > Signed-off-by: Xuan Zhuo > --- > drivers/virtio/virtio_ring.c | 9 + > 1 file changed, 5 insertions(+), 4 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 1ffab1eb40c0..e2fc50c05bec 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -2135,6 +2135,7 @@ static inline int virtqueue_add(struct virtqueue *_vq, > unsigned int in_sgs, > void *data, > void *ctx, > + bool premapped, > gfp_t gfp) > { > struct vring_virtqueue *vq = to_vvq(_vq); > @@ -2176,7 +2177,7 @@ int virtqueue_add_sgs(struct virtqueue *_vq, > total_sg++; > } > return virtqueue_add(_vq, sgs, total_sg, out_sgs, in_sgs, > -data, NULL, gfp); > +data, NULL, false, gfp); > } > EXPORT_SYMBOL_GPL(virtqueue_add_sgs); > > @@ -2198,7 +2199,7 @@ int virtqueue_add_outbuf(struct virtqueue *vq, > void *data, > gfp_t gfp) > { > - return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, gfp); > + return virtqueue_add(vq, &sg, num, 1, 0, data, NULL, false, gfp); > } > EXPORT_SYMBOL_GPL(virtqueue_add_outbuf); > > @@ -2220,7 +2221,7 @@ int virtqueue_add_inbuf(struct virtqueue *vq, > void *data, > gfp_t gfp) > { > - return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, gfp); > + return virtqueue_add(vq, &sg, num, 0, 1, data, NULL, false, gfp); > } > EXPORT_SYMBOL_GPL(virtqueue_add_inbuf); > > @@ -2244,7 +2245,7 @@ int virtqueue_add_inbuf_ctx(struct virtqueue *vq, > void *ctx, > gfp_t gfp) > { > - return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, gfp); > + return virtqueue_add(vq, &sg, num, 0, 1, data, ctx, false, gfp); > } > EXPORT_SYMBOL_GPL(virtqueue_add_inbuf_ctx); > > -- > 2.32.0.3.g01195cf9f > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v9 03/12] virtio_ring: check use_dma_api before unmap desc for indirect
On Wed, May 17, 2023 at 10:23 AM Xuan Zhuo wrote: > > Inside detach_buf_split(), if use_dma_api is false, > vring_unmap_one_split_indirect will be called many times, but actually > nothing is done. So this patch check use_dma_api firstly. > > Signed-off-by: Xuan Zhuo Acked-by: Jason Wang Thanks > --- > drivers/virtio/virtio_ring.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index 479203346c36..1ffab1eb40c0 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -783,8 +783,10 @@ static void detach_buf_split(struct vring_virtqueue *vq, > unsigned int head, > VRING_DESC_F_INDIRECT)); > BUG_ON(len == 0 || len % sizeof(struct vring_desc)); > > - for (j = 0; j < len / sizeof(struct vring_desc); j++) > - vring_unmap_one_split_indirect(vq, &indir_desc[j]); > + if (vq->use_dma_api) { > + for (j = 0; j < len / sizeof(struct vring_desc); j++) > + vring_unmap_one_split_indirect(vq, > &indir_desc[j]); > + } > > kfree(indir_desc); > state->indir_desc = NULL; > -- > 2.32.0.3.g01195cf9f > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v9 02/12] virtio_ring: simplify the reference of desc state inside detach_buf_split()
On Wed, May 17, 2023 at 10:23 AM Xuan Zhuo wrote: > > The purpose of this is to simplify the reference to state. It is > convenient for subsequent commit. It's better to be verbose, e.g how it can simplify the following patches. Thanks > > Signed-off-by: Xuan Zhuo > --- > drivers/virtio/virtio_ring.c | 12 +++- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index c563215be6b9..479203346c36 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -744,11 +744,14 @@ static bool virtqueue_kick_prepare_split(struct > virtqueue *_vq) > static void detach_buf_split(struct vring_virtqueue *vq, unsigned int head, > void **ctx) > { > + struct vring_desc_state_split *state; > unsigned int i, j; > __virtio16 nextflag = cpu_to_virtio16(vq->vq.vdev, VRING_DESC_F_NEXT); > > + state = &vq->split.desc_state[head]; > + > /* Clear data ptr. */ > - vq->split.desc_state[head].data = NULL; > + state->data = NULL; > > /* Put back on free list: unmap first-level descriptors and find end > */ > i = head; > @@ -767,8 +770,7 @@ static void detach_buf_split(struct vring_virtqueue *vq, > unsigned int head, > vq->vq.num_free++; > > if (vq->indirect) { > - struct vring_desc *indir_desc = > - vq->split.desc_state[head].indir_desc; > + struct vring_desc *indir_desc = state->indir_desc; > u32 len; > > /* Free the indirect table, if any, now that it's unmapped. */ > @@ -785,9 +787,9 @@ static void detach_buf_split(struct vring_virtqueue *vq, > unsigned int head, > vring_unmap_one_split_indirect(vq, &indir_desc[j]); > > kfree(indir_desc); > - vq->split.desc_state[head].indir_desc = NULL; > + state->indir_desc = NULL; > } else if (ctx) { > - *ctx = vq->split.desc_state[head].indir_desc; > + *ctx = state->indir_desc; > } > } > > -- > 2.32.0.3.g01195cf9f > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH vhost v9 01/12] virtio_ring: put mapping error check in vring_map_one_sg
On Wed, May 17, 2023 at 10:23 AM Xuan Zhuo wrote: > > This patch put the dma addr error check in vring_map_one_sg(). > > The benefits of doing this: > > 1. make vring_map_one_sg more simple, without calling >vring_mapping_error to check the return value. > 2. reduce one judgment of vq->use_dma_api. Code looks fine but it's better to explain how it relates or simply anything with this series. Thanks > > Signed-off-by: Xuan Zhuo > --- > drivers/virtio/virtio_ring.c | 37 +--- > 1 file changed, 22 insertions(+), 15 deletions(-) > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c > index c5310eaf8b46..c563215be6b9 100644 > --- a/drivers/virtio/virtio_ring.c > +++ b/drivers/virtio/virtio_ring.c > @@ -355,9 +355,8 @@ static struct device *vring_dma_dev(const struct > vring_virtqueue *vq) > } > > /* Map one sg entry. */ > -static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq, > - struct scatterlist *sg, > - enum dma_data_direction direction) > +static int vring_map_one_sg(const struct vring_virtqueue *vq, struct > scatterlist *sg, > + enum dma_data_direction direction, static > dma_addr_t *addr) > { > if (!vq->use_dma_api) { > /* > @@ -366,7 +365,8 @@ static dma_addr_t vring_map_one_sg(const struct > vring_virtqueue *vq, > * depending on the direction. > */ > kmsan_handle_dma(sg_page(sg), sg->offset, sg->length, > direction); > - return (dma_addr_t)sg_phys(sg); > + *addr = (dma_addr_t)sg_phys(sg); > + return 0; > } > > /* > @@ -374,9 +374,14 @@ static dma_addr_t vring_map_one_sg(const struct > vring_virtqueue *vq, > * the way it expects (we don't guarantee that the scatterlist > * will exist for the lifetime of the mapping). > */ > - return dma_map_page(vring_dma_dev(vq), > + *addr = dma_map_page(vring_dma_dev(vq), > sg_page(sg), sg->offset, sg->length, > direction); > + > + if (dma_mapping_error(vring_dma_dev(vq), *addr)) > + return -ENOMEM; > + > + return 0; > } > > static dma_addr_t vring_map_single(const struct vring_virtqueue *vq, > @@ -588,8 +593,9 @@ static inline int virtqueue_add_split(struct virtqueue > *_vq, > > for (n = 0; n < out_sgs; n++) { > for (sg = sgs[n]; sg; sg = sg_next(sg)) { > - dma_addr_t addr = vring_map_one_sg(vq, sg, > DMA_TO_DEVICE); > - if (vring_mapping_error(vq, addr)) > + dma_addr_t addr; > + > + if (vring_map_one_sg(vq, sg, DMA_TO_DEVICE, &addr)) > goto unmap_release; > > prev = i; > @@ -603,8 +609,9 @@ static inline int virtqueue_add_split(struct virtqueue > *_vq, > } > for (; n < (out_sgs + in_sgs); n++) { > for (sg = sgs[n]; sg; sg = sg_next(sg)) { > - dma_addr_t addr = vring_map_one_sg(vq, sg, > DMA_FROM_DEVICE); > - if (vring_mapping_error(vq, addr)) > + dma_addr_t addr; > + > + if (vring_map_one_sg(vq, sg, DMA_FROM_DEVICE, &addr)) > goto unmap_release; > > prev = i; > @@ -1279,9 +1286,8 @@ static int virtqueue_add_indirect_packed(struct > vring_virtqueue *vq, > > for (n = 0; n < out_sgs + in_sgs; n++) { > for (sg = sgs[n]; sg; sg = sg_next(sg)) { > - addr = vring_map_one_sg(vq, sg, n < out_sgs ? > - DMA_TO_DEVICE : DMA_FROM_DEVICE); > - if (vring_mapping_error(vq, addr)) > + if (vring_map_one_sg(vq, sg, n < out_sgs ? > +DMA_TO_DEVICE : DMA_FROM_DEVICE, > &addr)) > goto unmap_release; > > desc[i].flags = cpu_to_le16(n < out_sgs ? > @@ -1426,9 +1432,10 @@ static inline int virtqueue_add_packed(struct > virtqueue *_vq, > c = 0; > for (n = 0; n < out_sgs + in_sgs; n++) { > for (sg = sgs[n]; sg; sg = sg_next(sg)) { > - dma_addr_t addr = vring_map_one_sg(vq, sg, n < > out_sgs ? > - DMA_TO_DEVICE : DMA_FROM_DEVICE); > - if (vring_mapping_error(vq, addr)) > + dma_addr_t addr; > + > + if (vring_map_one_sg(vq, sg, n < out_sgs ? > +DMA_TO_DEVICE : DMA_FROM_DEVICE, > &addr)) > goto unmap_release; > > flags = cpu_to_le16(vq->packed.av
Re: [PATCH v2 2/3] vhost: support PACKED when setting-getting vring_base
On Wed, May 17, 2023 at 3:00 PM Stefano Garzarella wrote: > > On Wed, May 17, 2023 at 7:26 AM Jason Wang wrote: > > > > On Wed, May 17, 2023 at 2:26 AM Shannon Nelson > > wrote: > > > > > > On 5/16/23 12:49 AM, Stefano Garzarella wrote: > > > > On Mon, May 15, 2023 at 01:41:12PM -0700, Shannon Nelson wrote: > > > >> On 5/9/23 1:46 AM, Stefano Garzarella wrote: > > > >>> On Mon, Apr 24, 2023 at 03:50:30PM -0700, Shannon Nelson via > > > >>> Virtualization wrote: > > > Use the right structs for PACKED or split vqs when setting and > > > getting the vring base. > > > > > > Signed-off-by: Shannon Nelson > > > --- > > > drivers/vhost/vhost.c | 18 +- > > > drivers/vhost/vhost.h | 8 ++-- > > > 2 files changed, 19 insertions(+), 7 deletions(-) > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > index f11bdbe4c2c5..f64efda48f21 100644 > > > --- a/drivers/vhost/vhost.c > > > +++ b/drivers/vhost/vhost.c > > > @@ -1633,17 +1633,25 @@ long vhost_vring_ioctl(struct vhost_dev > > > *d, unsigned int ioctl, void __user *arg > > > r = -EFAULT; > > > break; > > > } > > > - if (s.num > 0x) { > > > - r = -EINVAL; > > > - break; > > > + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) { > > > + vq->last_avail_idx = s.num & 0x; > > > + vq->last_used_idx = (s.num >> 16) & 0x; > > > + } else { > > > + if (s.num > 0x) { > > > + r = -EINVAL; > > > + break; > > > + } > > > + vq->last_avail_idx = s.num; > > > } > > > - vq->last_avail_idx = s.num; > > > /* Forget the cached index value. */ > > > vq->avail_idx = vq->last_avail_idx; > > > break; > > > case VHOST_GET_VRING_BASE: > > > s.index = idx; > > > - s.num = vq->last_avail_idx; > > > + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) > > > + s.num = (u32)vq->last_avail_idx | > > > ((u32)vq->last_used_idx << 16); > > > + else > > > + s.num = vq->last_avail_idx; > > > >>> > > > >>> The changes LGTM, but since we are changing the UAPI, should we > > > >>> update the documentation of VHOST_SET_VRING_BASE and > > > >>> VHOST_GET_VRING_BASE in include/uapi/linux/vhost.h? > > > >> > > > >> Correct me if I'm wrong, but I don't think we're changing anything in > > > >> the UAPI here, just fixing code to work correctly with what is already > > > >> happening. > > > > > > > > IIUC before this patch VHOST_GET_VRING_BASE and VHOST_SET_VRING_BASE > > > > never worked with packed virtqueue, since we were only handling > > > > last_avail_idx. Now we are supporting packed virtqueue, handling > > > > in vhost_vring_state.num both last_avail_idx and last_used_idx (with > > > > wrap counters). > > > > > > > > For example for VHOST_GET_VRING_BASE where is documented that the first > > > > 15 bits are last_avail_idx, the 16th the avail_wrap_counter, and the > > > > others are last_used_idx and used_wrap_counter? > > > > > > > > Maybe I missed something, but since this is UAPI, IMHO we should > > > > document the parameters of ioctls at least in > > > > include/uapi/linux/vhost.h. > > > > > > Perhaps Jason already has something written up that could be put in here > > > from when he first added the wrap_counter a couple of years ago? > > > > If you meant the virtio driver support for packed, I think it's > > different from the context which is vhost here. > > > > I agree with Stefano that we need to update the comments around > > GET_VRING_BASE and SET_VRING_BASE, then we are fine. > > I'm thinking if we should also add a new VHOST_BACKEND_F_RING_PACKED > feature (or something similar) to inform the user space that now we > are able to handle packed virtqueue through vhost IOCTLs, otherwise > how can the userspace know if it is supported or not? I probably understand this but I think it should be done via VHOST_GET_FEAETURES. It would be a burden if we matianing duplicated features. Thanks > > Thanks, > Stefano > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
Long story short. In the patch below the first hunk is a noop. The code you are bypassing was added to ensure that process termination (aka SIGKILL) is processed before any other signals. Other than signal processing order there are not any substantive differences in the two code paths. With all signals except SIGSTOP == 19 and SIGKILL == 9 blocked SIGKILL should always be processed before SIGSTOP. Can you try patch with just the last hunk that does s/PF_IO_WORKER/PF_USER_WORKER/ and see if that is enough? I have no objections to the final hunk. Mike Christie writes: > This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is > set when we are dealing with PF_USER_WORKER tasks. > > When a vhost_task gets a SIGKILL, we could have outstanding IO in flight. > We can easily stop new work/IO from being queued to the vhost_task, but > for IO that's already been sent to something like the block layer we > need to wait for the response then process it. These type of IO > completions use the vhost_task to process the completion so we can't > exit immediately. > > We need to handle wait for then handle those completions from the > vhost_task, but when we have a SIGKLL pending, functions like > schedule() return immediately so we can't wait like normal. Functions > like vhost_worker() degrade to just a while(1); loop. > > This patch has get_signal drop down to the normal code path when > SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect > there is a SIGKILL but still perform some blocking cleanup. > > Note that in that chunk I'm now bypassing that does: > > sigdelset(¤t->pending.signal, SIGKILL); > > we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/ > group_exec_task we are already doing that on the threads in the > group. > > Signed-off-by: Mike Christie > --- > kernel/signal.c | 19 ++- > 1 file changed, 14 insertions(+), 5 deletions(-) > > diff --git a/kernel/signal.c b/kernel/signal.c > index 8f6330f0e9ca..ae4972eea5db 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig) > struct k_sigaction *ka; > enum pid_type type; > > - /* Has this task already been marked for death? */ > - if ((signal->flags & SIGNAL_GROUP_EXIT) || > - signal->group_exec_task) { > + /* > + * Has this task already been marked for death? > + * > + * If this is a PF_USER_WORKER then the task may need to do > + * extra work that requires waiting on running work, so we want > + * to dequeue the signal below and tell the caller its time to > + * start its exit procedure. When the work has completed then > + * the task will exit. > + */ > + if (!(current->flags & PF_USER_WORKER) && > + ((signal->flags & SIGNAL_GROUP_EXIT) || > + signal->group_exec_task)) { > clear_siginfo(&ksig->info); > ksig->info.si_signo = signr = SIGKILL; > sigdelset(¤t->pending.signal, SIGKILL); This hunk is a confusing no-op. > @@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig) > } > > /* > - * PF_IO_WORKER threads will catch and exit on fatal signals > + * PF_USER_WORKER threads will catch and exit on fatal signals >* themselves. They have cleanup that must be performed, so >* we cannot call do_exit() on their behalf. >*/ > - if (current->flags & PF_IO_WORKER) > + if (current->flags & PF_USER_WORKER) > goto out; > > /* This hunk is good and makes sense. Eric ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
Mike Christie writes: > This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is > set when we are dealing with PF_USER_WORKER tasks. > When a vhost_task gets a SIGKILL, we could have outstanding IO in flight. > We can easily stop new work/IO from being queued to the vhost_task, but > for IO that's already been sent to something like the block layer we > need to wait for the response then process it. These type of IO > completions use the vhost_task to process the completion so we can't > exit immediately. I understand the concern. > We need to handle wait for then handle those completions from the > vhost_task, but when we have a SIGKLL pending, functions like > schedule() return immediately so we can't wait like normal. Functions > like vhost_worker() degrade to just a while(1); loop. > > This patch has get_signal drop down to the normal code path when > SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect > there is a SIGKILL but still perform some blocking cleanup. > > Note that in that chunk I'm now bypassing that does: > > sigdelset(¤t->pending.signal, SIGKILL); > > we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/ > group_exec_task we are already doing that on the threads in the > group. What you are doing does not make any sense to me. First there is the semantic non-sense, of queuing something that is not a signal. The per task SIGKILL bit is used as a flag with essentially the same meaning as SIGNAL_GROUP_EXIT, reporting that the task has been scheduled for exit. More so is what happens afterwards. As I read your patch it is roughly equivalent to doing: if ((current->flags & PF_USER_WORKER) && fatal_signal_pending(current)) { sigdelset(¤t->pending.signal, SIGKILL); clear_siginfo(&ksig->info); ksig->info.si_signo = SIGKILL; ksig->info.si_code = SI_USER; recalc_sigpending(); trace_signal_deliver(SIGKILL, &ksig->info, &sighand->action[SIGKILL - 1]); goto fatal; } Before the "(SIGNAL_GROUP_EXIT || signal->group_exec_task)" test. To get that code I stripped the active statements out of the dequeue_signal path the code executes after your change below. I don't get why you are making it though because the code you are opting out of does: /* Has this task already been marked for death? */ if ((signal->flags & SIGNAL_GROUP_EXIT) || signal->group_exec_task) { clear_siginfo(&ksig->info); ksig->info.si_signo = signr = SIGKILL; sigdelset(¤t->pending.signal, SIGKILL); trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO, &sighand->action[SIGKILL - 1]); recalc_sigpending(); goto fatal; } I don't see what in practice changes, other than the fact that by going through the ordinary dequeue_signal path that other signals can be processed after a SIGKILL has arrived. Of course those signal all should be blocked. The trailing bit that expands the PF_IO_WORKER test to be PF_USER_WORKER appears reasonable, and possibly needed. Eric > Signed-off-by: Mike Christie > --- > kernel/signal.c | 19 ++- > 1 file changed, 14 insertions(+), 5 deletions(-) > > diff --git a/kernel/signal.c b/kernel/signal.c > index 8f6330f0e9ca..ae4972eea5db 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig) > struct k_sigaction *ka; > enum pid_type type; > > - /* Has this task already been marked for death? */ > - if ((signal->flags & SIGNAL_GROUP_EXIT) || > - signal->group_exec_task) { > + /* > + * Has this task already been marked for death? > + * > + * If this is a PF_USER_WORKER then the task may need to do > + * extra work that requires waiting on running work, so we want > + * to dequeue the signal below and tell the caller its time to > + * start its exit procedure. When the work has completed then > + * the task will exit. > + */ > + if (!(current->flags & PF_USER_WORKER) && > + ((signal->flags & SIGNAL_GROUP_EXIT) || > + signal->group_exec_task)) { > clear_siginfo(&ksig->info); > ksig->info.si_signo = signr = SIGKILL; > sigdelset(¤t->pending.signal, SIGKILL); > @@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig) > } > > /* > - * PF_IO_WORKER threads will catch and exit on fatal signals > + * PF_USER_WORKER threads will catch and
Re: [RFC PATCH 8/8] fork/vhost_task: remove no_files
On 5/17/23 7:09 PM, Mike Christie wrote: > + CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND, Sorry. I tried to throw this one in last second so we could see that we can also see that we can now use CLONE_FILES like io_uring. It will of course not compile. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
On 5/17/23 7:16 PM, Linus Torvalds wrote: > On Wed, May 17, 2023 at 5:09 PM Mike Christie > wrote: >> >> + __set_current_state(TASK_RUNNING); >> + rc = get_signal(&ksig); >> + set_current_state(TASK_INTERRUPTIBLE); >> + return rc; > > The games with current_state seem nonsensical. > > What are they all about? get_signal() shouldn't care, and no other > caller does this thing. This just seems completely random. Sorry. It's a leftover. I was originally calling this from vhost_task_should_stop where before calling that function we do a: set_current_state(TASK_INTERRUPTIBLE); So, I was hitting get_signal->try_to_freeze->might_sleep->__might_sleep and was getting the "do not call blocking ops when !TASK_RUNNING" warnings. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Re: Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods
On 5/17/23 18:39, Michael S. Tsirkin wrote: On Wed, May 17, 2023 at 04:35:55PM +0800, zhenwei pi wrote: On 5/17/23 15:46, Christoph Hellwig wrote: On Wed, May 17, 2023 at 03:43:03PM +0800, zhenwei pi wrote: I have a plan to introduce 'Virtio Over Fabrics'(TCP&RDMA) as Virtio transport, as mentioned in cover letter of this series: 3 weeks ago, I posted a proposal 'Virtio Over Fabrics': https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html Just don't do it. Please define your own protocols over RDMA or TCP for exactly the operations you need (for many they will already exist) instead of piggyg backing on virtio and making everyone else pay the price. Hi 1, `virtqueue_add_inbuf` in current version: static inline int virtqueue_add_inbuf(struct virtqueue *vq, struct scatterlist *sg, unsigned int num, void *data, gfp_t gfp) { if (likely(!vq->abstract)) return vring_virtqueue_add_sgs(vq, &sg, num, 0, 1, data, NULL, gfp); return vq->add_sgs(vq, &sg, num, 0, 1, data, NULL, gfp); } And disassemble 'virtinput_queue_evtbuf': static void virtinput_queue_evtbuf(struct virtio_input *vi, struct virtio_input_event *evtbuf) { struct scatterlist sg[1]; sg_init_one(sg, evtbuf, sizeof(*evtbuf)); virtqueue_add_inbuf(vi->evt, sg, 1, evtbuf, GFP_ATOMIC); } I notice that two instructions are newly added for vring like: 24d: 80 78 35 00 cmpb $0x0,0x35(%rax) 251: 75 3f jne292 Is it an expensive price... Can we somehow only override the kick method? Then take the ring and send it over ... Could you please take a look at this code? https://github.com/pizhenwei/linux/blob/virtio-of-github/drivers/virtio/virtio_fabrics.c#LL861C13-L861C23 2, Storage/FS specific remote protocol is quite popular, otherwise I'm not familiar with other device protocols. For example, I need a remote crypto device to accelerate HTTPS ... With Virtio Over Fabrics, I have a chance to attach a virtio-crypto device to do this work. -- zhenwei pi -- zhenwei pi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
On Wed, May 17, 2023 at 5:09 PM Mike Christie wrote: > > + __set_current_state(TASK_RUNNING); > + rc = get_signal(&ksig); > + set_current_state(TASK_INTERRUPTIBLE); > + return rc; The games with current_state seem nonsensical. What are they all about? get_signal() shouldn't care, and no other caller does this thing. This just seems completely random. Linus ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 8/8] fork/vhost_task: remove no_files
The vhost_task can now support the worker being freed from under the device when we get a SIGKILL or the process exits without closing devices. We no longer need no_files so this removes it. Signed-off-by: Mike Christie --- include/linux/sched/task.h | 1 - kernel/fork.c | 10 ++ kernel/vhost_task.c| 3 +-- 3 files changed, 3 insertions(+), 11 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 249a5ece9def..342fe297ffd4 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -28,7 +28,6 @@ struct kernel_clone_args { u32 kthread:1; u32 io_thread:1; u32 user_worker:1; - u32 no_files:1; u32 block_signals:1; unsigned long stack; unsigned long stack_size; diff --git a/kernel/fork.c b/kernel/fork.c index 9e04ab5c3946..f2c081c15efb 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1769,8 +1769,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) return 0; } -static int copy_files(unsigned long clone_flags, struct task_struct *tsk, - int no_files) +static int copy_files(unsigned long clone_flags, struct task_struct *tsk) { struct files_struct *oldf, *newf; int error = 0; @@ -1782,11 +1781,6 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk, if (!oldf) goto out; - if (no_files) { - tsk->files = NULL; - goto out; - } - if (clone_flags & CLONE_FILES) { atomic_inc(&oldf->count); goto out; @@ -2488,7 +2482,7 @@ __latent_entropy struct task_struct *copy_process( retval = copy_semundo(clone_flags, p); if (retval) goto bad_fork_cleanup_security; - retval = copy_files(clone_flags, p, args->no_files); + retval = copy_files(clone_flags, p); if (retval) goto bad_fork_cleanup_semundo; retval = copy_fs(clone_flags, p); diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index a11f036290cc..642047765190 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -96,12 +96,11 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, { struct kernel_clone_args args = { .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | - CLONE_THREAD | CLONE_SIGHAND, + CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND, .exit_signal= 0, .fn = vhost_task_fn, .name = name, .user_worker= 1, - .no_files = 1, .block_signals = 1, }; struct vhost_task *vtsk; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works
This moves the scsi code we use to stop new works from being queued and wait on running works to a helper which is used by the vhost layer when the vhost_task is being killed by a SIGKILL. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 40f9135e1a62..a0f2588270f2 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1768,6 +1768,19 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features) return 0; } +static void vhost_scsi_stop_dev_work(struct vhost_dev *dev) +{ + struct vhost_scsi *vs = container_of(dev, struct vhost_scsi, dev); + struct vhost_scsi_target t; + + mutex_lock(&vs->dev.mutex); + memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn)); + mutex_unlock(&vs->dev.mutex); + vhost_scsi_clear_endpoint(vs, &t); + vhost_dev_stop(&vs->dev); + vhost_dev_cleanup(&vs->dev); +} + static int vhost_scsi_open(struct inode *inode, struct file *f) { struct vhost_scsi *vs; @@ -1821,7 +1834,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick; } vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0, - true, NULL, NULL); + true, NULL, vhost_scsi_stop_dev_work); vhost_scsi_init_inflight(vs, NULL); @@ -1843,14 +1856,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) static int vhost_scsi_release(struct inode *inode, struct file *f) { struct vhost_scsi *vs = f->private_data; - struct vhost_scsi_target t; - mutex_lock(&vs->dev.mutex); - memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn)); - mutex_unlock(&vs->dev.mutex); - vhost_scsi_clear_endpoint(vs, &t); - vhost_dev_stop(&vs->dev); - vhost_dev_cleanup(&vs->dev); + vhost_dev_stop_work(&vs->dev); kfree(vs->dev.vqs); kfree(vs->vqs); kfree(vs->old_inflight); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 7/8] vhost-net: Add callback to stop and wait on works
This moves the net code we use to stop new works from being queued and wait on running works to a helper which is used by the vhost layer when the vhost_task is being killed by a SIGKILL. Signed-off-by: Mike Christie --- drivers/vhost/net.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 90c25127b3f8..f8a5527b15ba 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1325,9 +1325,9 @@ static void vhost_net_flush(struct vhost_net *n) } } -static int vhost_net_release(struct inode *inode, struct file *f) +static void vhost_net_stop_dev_work(struct vhost_dev *dev) { - struct vhost_net *n = f->private_data; + struct vhost_net *n = container_of(dev, struct vhost_net, dev); struct socket *tx_sock; struct socket *rx_sock; @@ -1345,6 +1345,13 @@ static int vhost_net_release(struct inode *inode, struct file *f) /* We do an extra flush before freeing memory, * since jobs can re-queue themselves. */ vhost_net_flush(n); +} + +static int vhost_net_release(struct inode *inode, struct file *f) +{ + struct vhost_net *n = f->private_data; + + vhost_dev_stop_work(&n->dev); kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue); kfree(n->vqs[VHOST_NET_VQ_TX].xdp); kfree(n->dev.vqs); @@ -1409,7 +1416,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, UIO_MAXIOV + VHOST_NET_BATCH, VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, - NULL, NULL); + NULL, vhost_net_stop_dev_work); vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev); vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
When the vhost_task gets a SIGKILL we want to stop new work from being queued and also wait for and handle completions for running work. For the latter, we still need to use the vhost_task to handle the completing work so we can't just exit right away. But, this has us kick off the stopping and flushing/stopping of the device/vhost_task/worker to the system work queue while the vhost_task handles completions. When all completions are done we will then do vhost_task_stop and we will exit. Signed-off-by: Mike Christie --- drivers/vhost/net.c | 2 +- drivers/vhost/scsi.c | 4 ++-- drivers/vhost/test.c | 3 ++- drivers/vhost/vdpa.c | 2 +- drivers/vhost/vhost.c | 48 --- drivers/vhost/vhost.h | 10 - drivers/vhost/vsock.c | 4 ++-- 7 files changed, 58 insertions(+), 15 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8557072ff05e..90c25127b3f8 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1409,7 +1409,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, UIO_MAXIOV + VHOST_NET_BATCH, VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, - NULL); + NULL, NULL); vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev); vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev); diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index bb10fa4bb4f6..40f9135e1a62 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1820,8 +1820,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) vqs[i] = &vs->vqs[i].vq; vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick; } - vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, - VHOST_SCSI_WEIGHT, 0, true, NULL); + vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0, + true, NULL, NULL); vhost_scsi_init_inflight(vs, NULL); diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index 42c955a5b211..11a2823d7532 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -120,7 +120,8 @@ static int vhost_test_open(struct inode *inode, struct file *f) vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ]; n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV, - VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL); + VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL, + NULL); f->private_data = n; diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 8c1aefc865f0..de9a83ecb70d 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -1279,7 +1279,7 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep) vqs[i]->handle_kick = handle_vq_kick; } vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false, - vhost_vdpa_process_iotlb_msg); + vhost_vdpa_process_iotlb_msg, NULL); r = vhost_vdpa_alloc_domain(v); if (r) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 1ba9e068b2ab..4163c86db50c 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -336,6 +336,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, static int vhost_worker(void *data) { struct vhost_worker *worker = data; + struct vhost_dev *dev = worker->dev; struct vhost_work *work, *work_next; struct llist_node *node; @@ -352,12 +353,13 @@ static int vhost_worker(void *data) if (!node) { schedule(); /* -* When we get a SIGKILL our release function will -* be called. That will stop new IOs from being queued -* and check for outstanding cmd responses. It will then -* call vhost_task_stop to exit us. +* When we get a SIGKILL we kick off a work to +* run the driver's helper to stop new work and +* handle completions. When they are done they will +* call vhost_task_stop to tell us to exit. */ - vhost_task_get_signal(); + if (vhost_task_get_signal()) + schedule_work(&dev->destroy_worker); } node = llist_reverse_order(node); @@ -376,6 +378,33 @@ static int vhost_worker(void *data) return 0; } +static void __vhost_dev_stop_work(struct vhost_dev *dev) +{ + mutex_lock(&dev->stop_work_mutex); + if (dev->work_stopped) + goto done; + + if (dev->stop_dev_wor
[RFC PATCH 4/8] vhost-net: Move vhost_net_open
This moves vhost_net_open so in the next patches we can pass vhost_dev_init a new helper which will use the stop/flush functions. There is no functionality changes in this patch. Signed-off-by: Mike Christie --- drivers/vhost/net.c | 134 ++-- 1 file changed, 67 insertions(+), 67 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 07181cd8d52e..8557072ff05e 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1285,73 +1285,6 @@ static void handle_rx_net(struct vhost_work *work) handle_rx(net); } -static int vhost_net_open(struct inode *inode, struct file *f) -{ - struct vhost_net *n; - struct vhost_dev *dev; - struct vhost_virtqueue **vqs; - void **queue; - struct xdp_buff *xdp; - int i; - - n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL); - if (!n) - return -ENOMEM; - vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL); - if (!vqs) { - kvfree(n); - return -ENOMEM; - } - - queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *), - GFP_KERNEL); - if (!queue) { - kfree(vqs); - kvfree(n); - return -ENOMEM; - } - n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue; - - xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL); - if (!xdp) { - kfree(vqs); - kvfree(n); - kfree(queue); - return -ENOMEM; - } - n->vqs[VHOST_NET_VQ_TX].xdp = xdp; - - dev = &n->dev; - vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq; - vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq; - n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick; - n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick; - for (i = 0; i < VHOST_NET_VQ_MAX; i++) { - n->vqs[i].ubufs = NULL; - n->vqs[i].ubuf_info = NULL; - n->vqs[i].upend_idx = 0; - n->vqs[i].done_idx = 0; - n->vqs[i].batched_xdp = 0; - n->vqs[i].vhost_hlen = 0; - n->vqs[i].sock_hlen = 0; - n->vqs[i].rx_ring = NULL; - vhost_net_buf_init(&n->vqs[i].rxq); - } - vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, - UIO_MAXIOV + VHOST_NET_BATCH, - VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, - NULL); - - vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev); - vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev); - - f->private_data = n; - n->page_frag.page = NULL; - n->refcnt_bias = 0; - - return 0; -} - static struct socket *vhost_net_stop_vq(struct vhost_net *n, struct vhost_virtqueue *vq) { @@ -1421,6 +1354,73 @@ static int vhost_net_release(struct inode *inode, struct file *f) return 0; } +static int vhost_net_open(struct inode *inode, struct file *f) +{ + struct vhost_net *n; + struct vhost_dev *dev; + struct vhost_virtqueue **vqs; + void **queue; + struct xdp_buff *xdp; + int i; + + n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL); + if (!n) + return -ENOMEM; + vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL); + if (!vqs) { + kvfree(n); + return -ENOMEM; + } + + queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *), + GFP_KERNEL); + if (!queue) { + kfree(vqs); + kvfree(n); + return -ENOMEM; + } + n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue; + + xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL); + if (!xdp) { + kfree(vqs); + kvfree(n); + kfree(queue); + return -ENOMEM; + } + n->vqs[VHOST_NET_VQ_TX].xdp = xdp; + + dev = &n->dev; + vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq; + vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq; + n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick; + n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick; + for (i = 0; i < VHOST_NET_VQ_MAX; i++) { + n->vqs[i].ubufs = NULL; + n->vqs[i].ubuf_info = NULL; + n->vqs[i].upend_idx = 0; + n->vqs[i].done_idx = 0; + n->vqs[i].batched_xdp = 0; + n->vqs[i].vhost_hlen = 0; + n->vqs[i].sock_hlen = 0; + n->vqs[i].rx_ring = NULL; + vhost_net_buf_init(&n->vqs[i].rxq); + } + vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, + UIO_MAXIOV + VHOST_NET_BATCH, + VHOST_N
[RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set when we are dealing with PF_USER_WORKER tasks. When a vhost_task gets a SIGKILL, we could have outstanding IO in flight. We can easily stop new work/IO from being queued to the vhost_task, but for IO that's already been sent to something like the block layer we need to wait for the response then process it. These type of IO completions use the vhost_task to process the completion so we can't exit immediately. We need to handle wait for then handle those completions from the vhost_task, but when we have a SIGKLL pending, functions like schedule() return immediately so we can't wait like normal. Functions like vhost_worker() degrade to just a while(1); loop. This patch has get_signal drop down to the normal code path when SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect there is a SIGKILL but still perform some blocking cleanup. Note that in that chunk I'm now bypassing that does: sigdelset(¤t->pending.signal, SIGKILL); we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/ group_exec_task we are already doing that on the threads in the group. Signed-off-by: Mike Christie --- kernel/signal.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..ae4972eea5db 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig) struct k_sigaction *ka; enum pid_type type; - /* Has this task already been marked for death? */ - if ((signal->flags & SIGNAL_GROUP_EXIT) || -signal->group_exec_task) { + /* +* Has this task already been marked for death? +* +* If this is a PF_USER_WORKER then the task may need to do +* extra work that requires waiting on running work, so we want +* to dequeue the signal below and tell the caller its time to +* start its exit procedure. When the work has completed then +* the task will exit. +*/ + if (!(current->flags & PF_USER_WORKER) && + ((signal->flags & SIGNAL_GROUP_EXIT) || +signal->group_exec_task)) { clear_siginfo(&ksig->info); ksig->info.si_signo = signr = SIGKILL; sigdelset(¤t->pending.signal, SIGKILL); @@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig) } /* -* PF_IO_WORKER threads will catch and exit on fatal signals +* PF_USER_WORKER threads will catch and exit on fatal signals * themselves. They have cleanup that must be performed, so * we cannot call do_exit() on their behalf. */ - if (current->flags & PF_IO_WORKER) + if (current->flags & PF_USER_WORKER) goto out; /* -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
This patch has vhost use get_signal to handle freezing and sort of handle signals. By the latter I mean that when we get SIGKILL, our parent will exit and call our file_operatons release function. That will then stop new work from breing queued and wait for the vhost_task to handle completions for running IO. We then exit when those are done. The next patches will then have us work more like io_uring where we handle the get_signal return value and key off that to cleanup. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c| 10 +- include/linux/sched/vhost_task.h | 1 + kernel/vhost_task.c | 20 3 files changed, 30 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..1ba9e068b2ab 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -349,8 +349,16 @@ static int vhost_worker(void *data) } node = llist_del_all(&worker->work_list); - if (!node) + if (!node) { schedule(); + /* +* When we get a SIGKILL our release function will +* be called. That will stop new IOs from being queued +* and check for outstanding cmd responses. It will then +* call vhost_task_stop to exit us. +*/ + vhost_task_get_signal(); + } node = llist_reverse_order(node); /* make sure flag is seen after deletion */ diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h index 6123c10b99cf..54b68115eb3b 100644 --- a/include/linux/sched/vhost_task.h +++ b/include/linux/sched/vhost_task.h @@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, void vhost_task_start(struct vhost_task *vtsk); void vhost_task_stop(struct vhost_task *vtsk); bool vhost_task_should_stop(struct vhost_task *vtsk); +bool vhost_task_get_signal(void); #endif diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..a661cfa32ba3 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -61,6 +61,26 @@ bool vhost_task_should_stop(struct vhost_task *vtsk) } EXPORT_SYMBOL_GPL(vhost_task_should_stop); +/** + * vhost_task_get_signal - Check if there are pending signals + * + * Return true if we got SIGKILL. + */ +bool vhost_task_get_signal(void) +{ + struct ksignal ksig; + bool rc; + + if (!signal_pending(current)) + return false; + + __set_current_state(TASK_RUNNING); + rc = get_signal(&ksig); + set_current_state(TASK_INTERRUPTIBLE); + return rc; +} +EXPORT_SYMBOL_GPL(vhost_task_get_signal); + /** * vhost_task_create - create a copy of a process to be used by the kernel * @fn: thread stack -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND
This is a modified version of Linus's patch which has vhost_task use CLONE_THREAD and CLONE_SIGHAND and allow SIGKILL and SIGSTOP. I renamed the ignore_signals to block_signals based on Linus's comment where it aligns with what we are doing with the siginitsetinv p->blocked use and no longer calling ignore_signals. Signed-off-by: Mike Christie --- include/linux/sched/task.h | 2 +- kernel/fork.c | 12 +++- kernel/vhost_task.c| 5 +++-- 3 files changed, 7 insertions(+), 12 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 537cbf9a2ade..249a5ece9def 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -29,7 +29,7 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; - u32 ignore_signals:1; + u32 block_signals:1; unsigned long stack; unsigned long stack_size; unsigned long tls; diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..9e04ab5c3946 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process( p->flags |= PF_KTHREAD; if (args->user_worker) p->flags |= PF_USER_WORKER; - if (args->io_thread) { - /* -* Mark us an IO worker, and block any signal that isn't -* fatal or STOP -*/ + if (args->io_thread) p->flags |= PF_IO_WORKER; + if (args->block_signals) siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); - } if (args->name) strscpy_pad(p->comm, args->name, sizeof(p->comm)); @@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); - stackleak_task_init(p); if (pid != &init_struct_pid) { @@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) .fn_arg = arg, .io_thread = 1, .user_worker= 1, + .block_signals = 1, }; return copy_process(NULL, 0, node, &args); diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index a661cfa32ba3..a11f036290cc 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -95,13 +95,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, const char *name) { struct kernel_clone_args args = { - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | + CLONE_THREAD | CLONE_SIGHAND, .exit_signal= 0, .fn = vhost_task_fn, .name = name, .user_worker= 1, .no_files = 1, - .ignore_signals = 1, + .block_signals = 1, }; struct vhost_task *vtsk; struct task_struct *tsk; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
This patch allows the vhost and vhost_task code to use CLONE_THREAD, CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the normal testing, haven't coverted vsock and vdpa, and I know you guys will not like the first patch. However, I think it better shows what we need from the signal code and how we can support signals in the vhost_task layer. Note that I took the super simple route and kicked off some work to the system workqueue. We can do more invassive approaches: 1. Modify the vhost drivers so they can check for IO completions using a non-blocking interface. We then don't need to run from the system workqueue and can run from the vhost_task. 2. We could drop patch 1 and just say we are doing a polling type of approach. We then modify the vhost layer similar to #1 where we can check for completions using a non-blocking interface and use the vhost_task task. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 5/17/23 12:09 PM, Oleg Nesterov wrote: > IIUC, Mike is going to send the next version? So I think we can delay > the further discussions until then. Yeah, I'm working on a version that supports signals so it will be easier to discuss with the vhost devs and you, Christian and Eric. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 05/16, Eric W. Biederman wrote: > > Oleg Nesterov writes: > > >> There is this bit in complete_signal when SIGKILL is delivered to any > >> thread in the process. > >> > >>t = p; > >>do { > >>task_clear_jobctl_pending(t, > >> JOBCTL_PENDING_MASK); > >>sigaddset(&t->pending.signal, SIGKILL); > >>signal_wake_up(t, 1); > >>} while_each_thread(p, t); > > > > That is why the latest version adds try_set_pending_sigkill(). No, no, > > it is not that I think this is a good idea. > > I see that try_set_pending_sigkill in the patch now. > > That try_set_pending_sigkill just keeps the process from reporting > that it has exited, and extend the process exit indefinitely. > > SIGNAL_GROUP_EXIT has already been set, so the KILL signal was > already delivered and the process is exiting. Agreed, that is why I said I don't think try_set_pending_sigkill() is a good idea. And again, the same is true for the threads created by create_io_thread(). get_signal() from io_uring/ can dequeue a pending SIGKILL and return, but that is all. > >> For clarity that sigaddset(&t->pending.signal, SIGKILL); Really isn't > >> setting SIGKILL pending, > > > > Hmm. it does? Nevermind. > > The point is that what try_set_pending_sigkill in the patch is doing is > keeping the "you are dead exit now" flag, from being set. > > That flag is what fatal_signal_pending always tests, because we can only > know if a fatal signal is pending if we have performed short circuit > delivery on the signal. > > The result is the effects of the change are mostly what people expect. > The difference the semantics being changed aren't what people think they > are. > > AKA process exit is being ignored for the thread, not that SIGKILL is > being blocked. Sorry, I don't understand. I just tried to say that sigaddset(&t->pending.signal, SIGKILL) really sets SIGKILL pending. Nevermind. > > Although I never understood this logic. I meant I never really liked how io-threads play with signals, > I can't even understand the usage > > of lower_32_bits() in create_io_thread(). > > As far as I can tell lower_32_bits(flags) is just defensive programming Cough. but this is ugly. Or I missed something. > or have just populated .flags directly. Exactly, > Then .exit_signal > could have been set to 0. Exactly. --- OK. It doesn't matter. I tried to read the whole thread and got lost. IIUC, Mike is going to send the next version? So I think we can delay the further discussions until then. Oleg. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/7] riscv: Memory Hot(Un)Plug support
On 12.05.23 16:57, Björn Töpel wrote: From: Björn Töpel Memory Hot(Un)Plug support for the RISC-V port == Introduction To quote "Documentation/admin-guide/mm/memory-hotplug.rst": "Memory hot(un)plug allows for increasing and decreasing the size of physical memory available to a machine at runtime." This series attempts to add memory hot(un)plug support for the RISC-V Linux port. I'm sending the series as a v1, but it's borderline RFC. It definitely needs more testing time, but it would be nice with some early input. Implementation -- From an arch perspective, a couple of callbacks needs to be implemented to support hot plugging: arch_add_memory() This callback is responsible for updating the linear/direct map, and call into the memory hot plugging generic code via __add_pages(). arch_remove_memory() In this callback the linear/direct map is tore down. vmemmap_free() The function tears down the vmemmap mappings (if CONFIG_SPARSEMEM_VMEMMAP is in-use), and also deallocates the backing vmemmap pages. Note that for persistent memory, an alternative allocator for the backing pages can be used -- the vmem_altmap. This means that when the backing pages are cleared, extra care is needed so that the correct deallocation method is used. Note that RISC-V populates the vmemmap using vmemmap_populate_basepages(), so currently no hugepages are used for the backing store. The page table unmap/teardown functions are heavily based (copied!) from the x86 tree. The same remove_pgd_mapping() is used in both vmemmap_free() and arch_remove_memory(), but in the latter function the backing pages are not removed. On RISC-V, the PGD level kernel mappings needs to synchronized with all page-tables (e.g. via sync_kernel_mappings()). Synchronization involves special care, like locking. Instead, this patch series takes a different approach (introduced by Jörg Rödel in the x86-tree); Pre-allocate the PGD-leaves (P4D, PUD, or PMD depending on the paging setup) at mem_init(), for vmemmap and the direct map. Pre-allocating the PGD-leaves waste some memory, but is only enabled for CONFIG_MEMORY_HOTPLUG. The number pages, potentially unused, are ~128 * 4K. Patch 1: Preparation for hotplugging support, by pre-allocating the PGD leaves. Patch 2: Changes the __init attribute to __meminit, to avoid that the functions are removed after init. __meminit keeps the functions after init, if memory hotplugging is enabled for the build. Patch 3: Refactor the direct map setup, so it can be used for hot add. Patch 4: The actual add/remove code. Mostly a page-table-walk exercise. Patch 5: Turn on the arch support in Kconfig Patch 6: Now that memory hotplugging is enabled, make virtio-mem usable for RISC-V Patch 7: Pre-allocate vmalloc PGD-leaves as well, which removes the need for vmalloc faulting. RFC --- * TLB flushes. The current series uses Big Hammer flush-it-all. * Pre-allocation vs explicit syncs Testing --- ACPI support is still in the making for RISC-V, so tests that involve CXL and similar fanciness is currently not possible. Virtio-mem, however, works without proper ACPI support. In order to try this out in Qemu, some additional patches for Qemu are needed: * Enable virtio-mem for RISC-V * Add proper hotplug support for virtio-mem The patch for Qemu can be found is commit 5d90a7ef1bc0 ("hw/riscv/virt: Support for virtio-mem-pci"), and can be found here https://github.com/bjoto/qemu/tree/riscv-virtio-mem I will try to upstream that work in parallel with this. Thanks to David Hildenbrand for valuable input for the Qemu side of things. The series is based on the RISC-V fixes tree https://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux.git/log/?h=fixes Cool stuff! I'm fairly busy right now, so some high-level questions upfront: What is the memory section size (which implies the memory block size and)? This implies the minimum DIMM granularity and the high-level granularity in which virtio-mem adds memory. What is the pageblock size, implying the minimum granularity that virtio-mem can operate on? On x86-64 and arm64 we currently use the ACPI SRAT to expose the maximum physical address where we can see memory getting hotplugged. [1] From that, we can derive the "max_possible_pfn" and prepare the kernel virtual memory layourt (especially, direct map). Is something similar required on RISC-V? On s390x, I'm planning on adding a paravirtualized mechanism to detect where memory devices might be located. (I had a running RFC, but was distracted by all other kinds of stuff) [1] https://virtio-mem.gitlab.io/developer-guide.html -- Thanks, David / dhildenb ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.or
Re: Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods
On Wed, May 17, 2023 at 04:35:55PM +0800, zhenwei pi wrote: > > > On 5/17/23 15:46, Christoph Hellwig wrote: > > On Wed, May 17, 2023 at 03:43:03PM +0800, zhenwei pi wrote: > > > I have a plan to introduce 'Virtio Over Fabrics'(TCP&RDMA) as Virtio > > > transport, as mentioned in cover letter of this series: > > > 3 weeks ago, I posted a proposal 'Virtio Over Fabrics': > > > https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html > > > > Just don't do it. Please define your own protocols over RDMA or TCP > > for exactly the operations you need (for many they will already exist) > > instead of piggyg backing on virtio and making everyone else pay the > > price. > > > > Hi > > 1, `virtqueue_add_inbuf` in current version: > static inline int virtqueue_add_inbuf(struct virtqueue *vq, > struct scatterlist *sg, > unsigned int num, > void *data, > gfp_t gfp) > { > if (likely(!vq->abstract)) > return vring_virtqueue_add_sgs(vq, &sg, num, 0, 1, data, > NULL, gfp); > > return vq->add_sgs(vq, &sg, num, 0, 1, data, NULL, gfp); > } > > And disassemble 'virtinput_queue_evtbuf': > static void virtinput_queue_evtbuf(struct virtio_input *vi, >struct virtio_input_event *evtbuf) > { > struct scatterlist sg[1]; > > sg_init_one(sg, evtbuf, sizeof(*evtbuf)); > virtqueue_add_inbuf(vi->evt, sg, 1, evtbuf, GFP_ATOMIC); > } > > I notice that two instructions are newly added for vring like: > 24d: 80 78 35 00 cmpb $0x0,0x35(%rax) > 251: 75 3f jne292 > > Is it an expensive price... Can we somehow only override the kick method? Then take the ring and send it over ... > 2, Storage/FS specific remote protocol is quite popular, otherwise I'm not > familiar with other device protocols. For example, I need a remote crypto > device to accelerate HTTPS ... With Virtio Over Fabrics, I have a chance to > attach a virtio-crypto device to do this work. > > -- > zhenwei pi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Re: Re: Re: [PATCH v2 0/2] virtio: abstract virtqueue related methods
On Wed, May 17, 2023 at 02:21:09PM +0800, zhenwei pi wrote: > On 5/17/23 14:10, Michael S. Tsirkin wrote: > > On Wed, May 17, 2023 at 12:58:10PM +0800, zhenwei pi wrote: > > > On 5/17/23 11:57, Michael S. Tsirkin wrote: > > > > On Wed, May 17, 2023 at 11:51:03AM +0800, zhenwei pi wrote: > > > > > > > > > > > > > > > On 5/17/23 11:46, Michael S. Tsirkin wrote: > > > > > > On Wed, May 17, 2023 at 10:54:22AM +0800, zhenwei pi wrote: > > > > > > > v1 -> v2: > > > > > > > - Suggested by MST, use fast path for vring based performance > > > > > > > sensitive API. > > > > > > > - Reduce changes in tools/virtio. > > > > > > > > > > > > > > Add test result(no obvious change): > > > > > > > Before: > > > > > > > time ./vringh_test --parallel > > > > > > > Using CPUS 0 and 191 > > > > > > > Guest: notified 10036893, pinged 68278 > > > > > > > Host: notified 68278, pinged 3093532 > > > > > > > > > > > > > > real 0m14.463s > > > > > > > user 0m6.437s > > > > > > > sys 0m8.010s > > > > > > > > > > > > > > After: > > > > > > > time ./vringh_test --parallel > > > > > > > Using CPUS 0 and 191 > > > > > > > Guest: notified 10036709, pinged 68347 > > > > > > > Host: notified 68347, pinged 3085292 > > > > > > > > > > > > > > real 0m14.196s > > > > > > > user 0m6.289s > > > > > > > sys 0m7.885s > > > > > > > > > > > > > > v1: > > > > > > > Hi, > > > > > > > > > > > > > > 3 weeks ago, I posted a proposal 'Virtio Over Fabrics': > > > > > > > https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html > > > > > > > > > > > > > > Jason and Stefan pointed out that a non-vring based virtqueue has > > > > > > > a > > > > > > > chance to overwrite virtqueue instead of using vring virtqueue. > > > > > > > > > > > > > > Then I try to abstract virtqueue related methods in this series, > > > > > > > the > > > > > > > details changes see the comment of patch 'virtio: abstract > > > > > > > virtqueue related methods'. > > > > > > > > > > > > > > Something is still remained: > > > > > > > - __virtqueue_break/__virtqueue_unbreak is supposed to use by > > > > > > > internal > > > > > > > virtio core, I'd like to rename them to vring_virtqueue_break > > > > > > > /vring_virtqueue_unbreak. Is this reasonable? > > > > > > > > > > > > Why? These just set a flag? > > > > > > > > > > > > > > > > Rename '__virtqueue_break' to 'vring_virtqueue_break', to make symbols > > > > > exported from virtio_ring.ko have unified prefix > > > > > 'vring_virtqueue_xxx'. > > > > > > > > I just do not see why you need these callbacks at all. > > > > > > > > > > I use these callbacks for break/unbreak device like: > > > static inline void virtio_break_device(struct virtio_device *dev) > > > { > > > struct virtqueue *vq; > > > > > > spin_lock(&dev->vqs_list_lock); > > > list_for_each_entry(vq, &dev->vqs, list) { > > > vq->__break(vq); > > > } > > > spin_unlock(&dev->vqs_list_lock); > > > } > > > > why do this? backend knows they are broken. > > > > I grep 'virtio_break_device' in the latest code: > arch/um/drivers/virtio_uml.c:1147:virtio_break_device(&vu_dev->vdev); > arch/um/drivers/virtio_uml.c:1285:virtio_break_device(&vu_dev->vdev); > drivers/crypto/virtio/virtio_crypto_core.c:269: > virtio_break_device(vcrypto->vdev); > drivers/s390/virtio/virtio_ccw.c:1251: > virtio_break_device(&vcdev->vdev); > drivers/s390/virtio/virtio_ccw.c:1268: > virtio_break_device(&vcdev->vdev); > drivers/firmware/arm_scmi/virtio.c:489: > virtio_break_device(vioch->vqueue->vdev); > drivers/char/virtio_console.c:1956: virtio_break_device(vdev); > > Some virtio drivers use 'virtio_break_device'... You should read the code and understand what it does, not just grep things and make assumptions. What virtio_break_device does is stop linux from sending new requests. > > > > > > > - > > > > > > > virtqueue_get_desc_addr/virtqueue_get_avail_addr/virtqueue_get_used_addr > > > > > > > /virtqueue_get_vring is vring specific, I'd like to rename > > > > > > > them like > > > > > > > vring_virtqueue_get_desc_addr. Is this reasonable? > > > > > > > - there are still some functions in virtio_ring.c with prefix > > > > > > > *virtqueue*, > > > > > > > for example 'virtqueue_add_split', just keep it or rename it > > > > > > > to > > > > > > > 'vring_virtqueue_add_split'? > > > > > > > zhenwei pi (2): > > > > > > > virtio: abstract virtqueue related methods > > > > > > > tools/virtio: implement virtqueue in test > > > > > > > > > > > > > > drivers/virtio/virtio_ring.c | 285 +- > > > > > > > include/linux/virtio.h | 441 > > > > > > > +++ > > > > > > > include/linux/virtio_ring.h | 26 +++ > > > > > > > tools/virtio/linux/virtio.h | 355 > > > > > > > +--- > > > > > > > 4 files changed, 807 insertions(+), 300 d
Re: Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods
On 5/17/23 15:46, Christoph Hellwig wrote: On Wed, May 17, 2023 at 03:43:03PM +0800, zhenwei pi wrote: I have a plan to introduce 'Virtio Over Fabrics'(TCP&RDMA) as Virtio transport, as mentioned in cover letter of this series: 3 weeks ago, I posted a proposal 'Virtio Over Fabrics': https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html Just don't do it. Please define your own protocols over RDMA or TCP for exactly the operations you need (for many they will already exist) instead of piggyg backing on virtio and making everyone else pay the price. Hi 1, `virtqueue_add_inbuf` in current version: static inline int virtqueue_add_inbuf(struct virtqueue *vq, struct scatterlist *sg, unsigned int num, void *data, gfp_t gfp) { if (likely(!vq->abstract)) return vring_virtqueue_add_sgs(vq, &sg, num, 0, 1, data, NULL, gfp); return vq->add_sgs(vq, &sg, num, 0, 1, data, NULL, gfp); } And disassemble 'virtinput_queue_evtbuf': static void virtinput_queue_evtbuf(struct virtio_input *vi, struct virtio_input_event *evtbuf) { struct scatterlist sg[1]; sg_init_one(sg, evtbuf, sizeof(*evtbuf)); virtqueue_add_inbuf(vi->evt, sg, 1, evtbuf, GFP_ATOMIC); } I notice that two instructions are newly added for vring like: 24d: 80 78 35 00 cmpb $0x0,0x35(%rax) 251: 75 3f jne292 Is it an expensive price... 2, Storage/FS specific remote protocol is quite popular, otherwise I'm not familiar with other device protocols. For example, I need a remote crypto device to accelerate HTTPS ... With Virtio Over Fabrics, I have a chance to attach a virtio-crypto device to do this work. -- zhenwei pi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods
On Wed, May 17, 2023 at 03:43:03PM +0800, zhenwei pi wrote: > I have a plan to introduce 'Virtio Over Fabrics'(TCP&RDMA) as Virtio > transport, as mentioned in cover letter of this series: > 3 weeks ago, I posted a proposal 'Virtio Over Fabrics': > https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html Just don't do it. Please define your own protocols over RDMA or TCP for exactly the operations you need (for many they will already exist) instead of piggyg backing on virtio and making everyone else pay the price. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods
On 5/17/23 15:39, Christoph Hellwig wrote: On Wed, May 17, 2023 at 10:54:23AM +0800, zhenwei pi wrote: All the vring based virtqueue methods could be abstratct in theory, MST suggested that add/get bufs and kick functions are quite perfmance sensitive, so export these functions from virtio_ring.ko, drivers still call them in a fast path. Who is going to use this? And why do you think every virtio users would want to pay for indirect calls just for your shiny new feature? This seems like an amazingly bad idea to me. Hi, I have a plan to introduce 'Virtio Over Fabrics'(TCP&RDMA) as Virtio transport, as mentioned in cover letter of this series: 3 weeks ago, I posted a proposal 'Virtio Over Fabrics': https://lists.oasis-open.org/archives/virtio-comment/202304/msg00442.html Jason and Stefan pointed out that a non-vring based virtqueue has a chance to overwrite virtqueue instead of using vring virtqueue. -- zhenwei pi ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/2] virtio: abstract virtqueue related methods
On Wed, May 17, 2023 at 10:54:23AM +0800, zhenwei pi wrote: > All the vring based virtqueue methods could be abstratct in theory, > MST suggested that add/get bufs and kick functions are quite perfmance > sensitive, so export these functions from virtio_ring.ko, drivers > still call them in a fast path. Who is going to use this? And why do you think every virtio users would want to pay for indirect calls just for your shiny new feature? This seems like an amazingly bad idea to me. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/3] vhost: support PACKED when setting-getting vring_base
On Wed, May 17, 2023 at 7:26 AM Jason Wang wrote: > > On Wed, May 17, 2023 at 2:26 AM Shannon Nelson wrote: > > > > On 5/16/23 12:49 AM, Stefano Garzarella wrote: > > > On Mon, May 15, 2023 at 01:41:12PM -0700, Shannon Nelson wrote: > > >> On 5/9/23 1:46 AM, Stefano Garzarella wrote: > > >>> On Mon, Apr 24, 2023 at 03:50:30PM -0700, Shannon Nelson via > > >>> Virtualization wrote: > > Use the right structs for PACKED or split vqs when setting and > > getting the vring base. > > > > Signed-off-by: Shannon Nelson > > --- > > drivers/vhost/vhost.c | 18 +- > > drivers/vhost/vhost.h | 8 ++-- > > 2 files changed, 19 insertions(+), 7 deletions(-) > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > index f11bdbe4c2c5..f64efda48f21 100644 > > --- a/drivers/vhost/vhost.c > > +++ b/drivers/vhost/vhost.c > > @@ -1633,17 +1633,25 @@ long vhost_vring_ioctl(struct vhost_dev > > *d, unsigned int ioctl, void __user *arg > > r = -EFAULT; > > break; > > } > > - if (s.num > 0x) { > > - r = -EINVAL; > > - break; > > + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) { > > + vq->last_avail_idx = s.num & 0x; > > + vq->last_used_idx = (s.num >> 16) & 0x; > > + } else { > > + if (s.num > 0x) { > > + r = -EINVAL; > > + break; > > + } > > + vq->last_avail_idx = s.num; > > } > > - vq->last_avail_idx = s.num; > > /* Forget the cached index value. */ > > vq->avail_idx = vq->last_avail_idx; > > break; > > case VHOST_GET_VRING_BASE: > > s.index = idx; > > - s.num = vq->last_avail_idx; > > + if (vhost_has_feature(vq, VIRTIO_F_RING_PACKED)) > > + s.num = (u32)vq->last_avail_idx | > > ((u32)vq->last_used_idx << 16); > > + else > > + s.num = vq->last_avail_idx; > > >>> > > >>> The changes LGTM, but since we are changing the UAPI, should we > > >>> update the documentation of VHOST_SET_VRING_BASE and > > >>> VHOST_GET_VRING_BASE in include/uapi/linux/vhost.h? > > >> > > >> Correct me if I'm wrong, but I don't think we're changing anything in > > >> the UAPI here, just fixing code to work correctly with what is already > > >> happening. > > > > > > IIUC before this patch VHOST_GET_VRING_BASE and VHOST_SET_VRING_BASE > > > never worked with packed virtqueue, since we were only handling > > > last_avail_idx. Now we are supporting packed virtqueue, handling > > > in vhost_vring_state.num both last_avail_idx and last_used_idx (with > > > wrap counters). > > > > > > For example for VHOST_GET_VRING_BASE where is documented that the first > > > 15 bits are last_avail_idx, the 16th the avail_wrap_counter, and the > > > others are last_used_idx and used_wrap_counter? > > > > > > Maybe I missed something, but since this is UAPI, IMHO we should > > > document the parameters of ioctls at least in > > > include/uapi/linux/vhost.h. > > > > Perhaps Jason already has something written up that could be put in here > > from when he first added the wrap_counter a couple of years ago? > > If you meant the virtio driver support for packed, I think it's > different from the context which is vhost here. > > I agree with Stefano that we need to update the comments around > GET_VRING_BASE and SET_VRING_BASE, then we are fine. I'm thinking if we should also add a new VHOST_BACKEND_F_RING_PACKED feature (or something similar) to inform the user space that now we are able to handle packed virtqueue through vhost IOCTLs, otherwise how can the userspace know if it is supported or not? Thanks, Stefano ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization