Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads
On 8/10/23 1:57 PM, Michael S. Tsirkin wrote: > On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.chris...@oracle.com wrote: >> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote: >>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote: >>>> For vhost workers we use the kthread API which inherit's its values from >>>> and checks against the kthreadd thread. This results in the wrong RLIMITs >>>> being checked, so while tools like libvirt try to control the number of >>>> threads based on the nproc rlimit setting we can end up creating more >>>> threads than the user wanted. >>>> >>>> This patch has us use the vhost_task helpers which will inherit its >>>> values/checks from the thread that owns the device similar to if we did >>>> a clone in userspace. The vhost threads will now be counted in the nproc >>>> rlimits. And we get features like cgroups and mm sharing automatically, >>>> so we can remove those calls. >>>> >>>> Signed-off-by: Mike Christie >>>> Acked-by: Michael S. Tsirkin >>> >>> >>> Hi Mike, >>> So this seems to have caused a measureable regression in networking >>> performance (about 30%). Take a look here, and there's a zip file >>> with detailed measuraments attached: >>> >>> https://bugzilla.redhat.com/show_bug.cgi?id=603 >>> >>> >>> Could you take a look please? >>> You can also ask reporter questions there assuming you >>> have or can create a (free) account. >>> >> >> Sorry for the late reply. I just got home from vacation. >> >> The account creation link seems to be down. I keep getting a >> "unable to establish SMTP connection to bz-exim-prod port 25 " error. >> >> Can you give me Quan's email? >> >> I think I can replicate the problem. I just need some extra info from Quan: >> >> 1. Just double check that they are using RHEL 9 on the host running the VMs. >> 2. The kernel config >> 3. Any tuning that was done. Is tuned running in guest and/or host running >> the >> VMs and what profile is being used in each. >> 4. Number of vCPUs and virtqueues being used. >> 5. Can they dump the contents of: >> >> /sys/kernel/debug/sched >> >> and >> >> sysctl -a >> >> on the host running the VMs. >> >> 6. With the 6.4 kernel, can they also run a quick test and tell me if they >> set >> the scheduler to batch: >> >> ps -T -o comm,pid,tid $QEMU_THREAD >> >> then for each vhost thread do: >> >> chrt -b -p 0 $VHOST_THREAD >> >> Does that end up increasing perf? When I do this I see throughput go up by >> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs >> and virtqueues per net device in the VM). Note that I'm not saying that is a >> fix. >> It's just a difference I noticed when running some other tests. > > > Mike I'm unsure what to do at this point. Regressions are not nice > but if the kernel is released with the new userspace api we won't > be able to revert. So what's the plan? > I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 and 6.4 perform the same for me. I've tried your setup and settings and with different combos of using things like tuned and irqbalance. I can sort of force the issue. In 6.4, the vhost thread inherits it's settings from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we would then reset the sched settings. So in 6.4 if I just tune the parent differently I can cause different performance. If we want the 6.3 behavior we can do the patch below. However, I don't think you guys are hitting this because you are just running qemu from the normal shell and were not doing anything fancy with the sched settings. diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index da35e5b7f047..f2c2638d1106 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -2,6 +2,7 @@ /* * Copyright (C) 2021 Oracle Corporation */ +#include #include #include #include @@ -22,9 +23,16 @@ struct vhost_task { static int vhost_task_fn(void *data) { + static const struct sched_param param = { .sched_priority = 0 }; struct vhost_task *vtsk = data; bool dead = false; + /* +* Don't inherit the parent's sched info, so we maintain compat from +* when we used kthreads and it reset this info. +*/ + sched_setscheduler_nocheck(current, SCHED_NORMAL, ¶m); + for (;;) { bool did_work; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/1] MAINTAINERS: add vhost-scsi entry and myself as a co-maintainer
I've been doing a lot of the development on vhost-scsi the last couple of years, so per Michael T's suggestion this adds me as co-maintainer. Signed-off-by: Mike Christie --- MAINTAINERS | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 3be1bdfe8ecc..2c4a8a860ae0 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -22458,7 +22458,6 @@ L: virtualization@lists.linux-foundation.org S: Maintained F: drivers/block/virtio_blk.c F: drivers/scsi/virtio_scsi.c -F: drivers/vhost/scsi.c F: include/uapi/linux/virtio_blk.h F: include/uapi/linux/virtio_scsi.h @@ -22557,6 +22556,16 @@ F: include/linux/vhost_iotlb.h F: include/uapi/linux/vhost.h F: kernel/vhost_task.c +VIRTIO HOST (VHOST-SCSI) +M: "Michael S. Tsirkin" +M: Jason Wang +M: Mike Christie +R: Paolo Bonzini +R: Stefan Hajnoczi +L: virtualization@lists.linux-foundation.org +S: Maintained +F: drivers/vhost/scsi.c + VIRTIO I2C DRIVER M: Conghui Chen M: Viresh Kumar -- 2.34.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 0/2] vhost-scsi: Fix IO hangs when using windows
On 7/12/23 9:26 AM, Stefan Hajnoczi wrote: > On Tue, Jul 11, 2023 at 04:01:22PM -0500, Mike Christie wrote: >> On 7/11/23 1:34 PM, Stefan Hajnoczi wrote: >>> On Sun, Jul 09, 2023 at 03:28:57PM -0500, Mike Christie wrote: >>>> The following patches were made over Linus's tree and fix an issue >>>> where windows guests will send iovecs with offset/lengths that result >>>> in IOs that are not aligned to 512. The LIO layer will then send them >>>> to Linux's FS/block layer but it requires 512 byte alignment, so >>>> depending on the FS/block driver being used we will get IO errors or >>>> hung IO. >>>> >>>> The following patches have vhost-scsi detect when windows sends these >>>> IOs and copy them to a bounce buffer. It then does some cleanup in >>>> the related code. >>> >>> Hang on, virtio-scsi is a SCSI HBA and READs/WRITEs submitted must >>> follow the usual constraints on SCSI block limits. Would Windows send >>> mis-aligned I/O to a non-virtio-scsi SCSI HBA? >> >> It's like linux where you can config settings like that. >> >>>> Are you sure this is not a bug in the Windows guest driver where block >>> limits are being misconfigured? >> >> From what our windows dev told us the guest drivers like here: >> >> https://github.com/virtio-win >> >> don't set the windows AlignmentMask to 512. They tried that and it >> resulted in windows crash dump crashing because it doesn't like the >> hard alignment requirement. >> >> We thought other apps would have trouble as well, so we tried to add >> bounce buffer support to the windows driver, but I think people thought >> it was going to be uglier than this patch and in the normal alignment >> case might also affect performance. There was some windows driver/layering >> and buffer/cmd details that I don't fully understand and took their word >> for because I don't know a lot about windows. >> >> In the end we still have to add checks to vhost-scsi to protect against >> bad drivers, so we thought we might as well just add bounce buffer support >> to vhost-scsi. > > CCing virtio-win developers so they can confirm how the vioscsi driver > is supposed to handle request alignment. > > My expectation is that the virtio-scsi device will fail mis-aligned I/O > requests. I don't think you can just change the driver's behavior to fail now, because apps send mis-aligned IO and its working as long as they have less than 256 bio vecs. We see mis-aligned IOs during boot and also from random non window's apps. If we just start to fail then it would be a regression when the app no longer works or the OS fails to start up. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 0/2] vhost-scsi: Fix IO hangs when using windows
On 7/11/23 1:34 PM, Stefan Hajnoczi wrote: > On Sun, Jul 09, 2023 at 03:28:57PM -0500, Mike Christie wrote: >> The following patches were made over Linus's tree and fix an issue >> where windows guests will send iovecs with offset/lengths that result >> in IOs that are not aligned to 512. The LIO layer will then send them >> to Linux's FS/block layer but it requires 512 byte alignment, so >> depending on the FS/block driver being used we will get IO errors or >> hung IO. >> >> The following patches have vhost-scsi detect when windows sends these >> IOs and copy them to a bounce buffer. It then does some cleanup in >> the related code. > > Hang on, virtio-scsi is a SCSI HBA and READs/WRITEs submitted must > follow the usual constraints on SCSI block limits. Would Windows send > mis-aligned I/O to a non-virtio-scsi SCSI HBA? It's like linux where you can config settings like that. > > Are you sure this is not a bug in the Windows guest driver where block > limits are being misconfigured? >From what our windows dev told us the guest drivers like here: https://github.com/virtio-win don't set the windows AlignmentMask to 512. They tried that and it resulted in windows crash dump crashing because it doesn't like the hard alignment requirement. We thought other apps would have trouble as well, so we tried to add bounce buffer support to the windows driver, but I think people thought it was going to be uglier than this patch and in the normal alignment case might also affect performance. There was some windows driver/layering and buffer/cmd details that I don't fully understand and took their word for because I don't know a lot about windows. In the end we still have to add checks to vhost-scsi to protect against bad drivers, so we thought we might as well just add bounce buffer support to vhost-scsi. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] Revert "virtio-scsi: Send "REPORTED LUNS CHANGED" sense data upon disk hotplug events"
What was the issue you are seeing? Was it something like you get the UA. We retry then on one of the retries the sense is not setup correctly, so the scsi error handler runs? That fails and the device goes offline? If you turn on scsi debugging you would see: [ 335.445922] sd 0:0:0:0: [sda] tag#15 Add. Sense: Reported luns data has changed [ 335.445922] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445925] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445929] sd 0:0:0:0: [sda] tag#17 Done: FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s [ 335.445932] sd 0:0:0:0: [sda] tag#17 CDB: Write(10) 2a 00 00 db 4f c0 00 00 20 00 [ 335.445934] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445936] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445938] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445940] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445942] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.445945] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 335.451447] scsi host0: scsi_eh_0: waking up 0/2/2 [ 335.451453] scsi host0: Total of 2 commands on 1 devices require eh work [ 335.451457] sd 0:0:0:0: [sda] tag#16 scsi_eh_0: requesting sense I don't know the qemu scsi code well, but I scanned the code for my co-worker and my guess was commit 8cc5583abe6419e7faaebc9fbd109f34f4c850f2 had a race in it. How is locking done? when it is a bus level UA but there are multiple devices on the bus? Is it possible, devA is clearing the sense on devB. For example, thread1 for devA is doing scsi_clear_unit_attention but then thread2 for devB has seen that bus->unit_attention so it set req ops to reqops_unit_attention. But when we run reqops_unit_attention.send_command scsi_unit_attention does not see req->bus->unit_attention set anymore so we get a CC with no sense. If the linux kernel scsi layer sees a CC with no sense then we fire the SCSI error handler like seen above in the logs. On 7/11/23 12:06 PM, Stefano Garzarella wrote: > CCing `./scripts/get_maintainer.pl -f drivers/scsi/virtio_scsi.c`, > since I found a few things in the virtio-scsi driver... > > FYI we have seen that Linux has problems with a QEMU patch for the > virtio-scsi device (details at the bottom of this email in the revert > commit message and BZ). > > > This is what I found when I looked at the Linux code: > > In scsi_report_sense() in linux/drivers/scsi/scsi_error.c linux calls > scsi_report_lun_change() that set `sdev_target->expecting_lun_change = > 1` when we receive a UNIT ATTENTION with REPORT LUNS CHANGED > (sshdr->asc == 0x3f && sshdr->ascq == 0x0e). > > When `sdev_target->expecting_lun_change = 1` is set and we call > scsi_check_sense(), for example to check the next UNIT ATTENTION, it > will return NEEDS_RETRY, that I think will cause the issues we are > seeing. > > `sdev_target->expecting_lun_change` is reset only in > scsi_decide_disposition() when `REPORT_LUNS` command returns with > SAM_STAT_GOOD. > That command is issued in scsi_report_lun_scan() called by > __scsi_scan_target(), called for example by scsi_scan_target(), > scsi_scan_host(), etc. > > So, checking QEMU, we send VIRTIO_SCSI_EVT_RESET_RESCAN during hotplug > and VIRTIO_SCSI_EVT_RESET_REMOVED during hotunplug. In both cases now we > send also the UNIT ATTENTION. > > In the virtio-scsi driver, when we receive VIRTIO_SCSI_EVT_RESET_RESCAN > (hotplug) we call scsi_scan_target() or scsi_add_device(). Both of them > will call __scsi_scan_target() at some points, sending `REPORT_LUNS` > command to the device. This does not happen for > VIRTIO_SCSI_EVT_RESET_REMOVED (hotunplug). Indeed if I remove the > UNIT ATTENTION from the hotunplug in QEMU, everything works well. > > So, I tried to add a scan also for VIRTIO_SCSI_EVT_RESET_REMOVED: > > diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c > index bd5633667d01..c57658a63097 100644 > --- a/drivers/scsi/virtio_scsi.c > +++ b/drivers/scsi/virtio_scsi.c > @@ -291,6 +291,7 @@ static void virtscsi_handle_transport_reset(struct > virtio_scsi *vscsi, > } > break; > case VIRTIO_SCSI_EVT_RESET_REMOVED: > + scsi_scan_host(shost); > sdev = scsi_device_lookup(shost, 0, target, lun); > if (sdev) { > scsi_remove_device(sdev); > > This somehow helps, now linux only breaks if the plug/unplug frequency > is really high. If I put a 5 second sleep between plug/unplug events, it > doesn't break (at least for the duration of my test which has been > running for about 30 minutes, before it used to break after about a > minute). > > Another thing I noticed is that in QEMU maybe we should set the UNIT > ATTENTION fi
[PATCH v2 1/2] vhost-scsi: Fix alignment handling with windows
The linux block layer requires bios/requests to have lengths with a 512 byte alignment. Some drivers/layers like dm-crypt and the directi IO code will test for it and just fail. Other drivers like SCSI just assume the requirement is met and will end up in infinte retry loops. The problem for drivers like SCSI is that it uses functions like blk_rq_cur_sectors and blk_rq_sectors which divide the request's length by 512. If there's lefovers then it just gets dropped. But other code in the block/scsi layer may use blk_rq_bytes/blk_rq_cur_bytes and end up thinking there is still data left and try to retry the cmd. We can then end up getting stuck in retry loops where part of the block/scsi thinks there is data left, but other parts think we want to do IOs of zero length. Linux will always check for alignment, but windows will not. When vhost-scsi then translates the iovec it gets from a windows guest to a scatterlist, we can end up with sg items where the sg->length is not divisible by 512 due to the misaligned offset: sg[0].offset = 255; sg[0].length = 3841; sg... sg[N].offset = 0; sg[N].length = 255; When the lio backends then convert the SG to bios or other iovecs, we end up sending them with the same misaligned values and can hit the issues above. This just has us drop down to allocating a temp page and copying the data when we detect a misaligned buffer and the IO is large enough that it will get split into multiple bad IOs. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 186 +-- 1 file changed, 161 insertions(+), 25 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index c83f7f043470..324e4b3846fa 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -25,6 +25,8 @@ #include #include #include +#include +#include #include #include #include @@ -75,6 +77,9 @@ struct vhost_scsi_cmd { u32 tvc_prot_sgl_count; /* Saved unpacked SCSI LUN for vhost_scsi_target_queue_cmd() */ u32 tvc_lun; + u32 copied_iov:1; + const void *saved_iter_addr; + struct iov_iter saved_iter; /* Pointer to the SGL formatted memory from virtio-scsi */ struct scatterlist *tvc_sgl; struct scatterlist *tvc_prot_sgl; @@ -328,8 +333,13 @@ static void vhost_scsi_release_cmd_res(struct se_cmd *se_cmd) int i; if (tv_cmd->tvc_sgl_count) { - for (i = 0; i < tv_cmd->tvc_sgl_count; i++) - put_page(sg_page(&tv_cmd->tvc_sgl[i])); + for (i = 0; i < tv_cmd->tvc_sgl_count; i++) { + if (tv_cmd->copied_iov) + __free_page(sg_page(&tv_cmd->tvc_sgl[i])); + else + put_page(sg_page(&tv_cmd->tvc_sgl[i])); + } + kfree(tv_cmd->saved_iter_addr); } if (tv_cmd->tvc_prot_sgl_count) { for (i = 0; i < tv_cmd->tvc_prot_sgl_count; i++) @@ -504,6 +514,28 @@ static void vhost_scsi_evt_work(struct vhost_work *work) mutex_unlock(&vq->mutex); } +static int vhost_scsi_copy_sgl_to_iov(struct vhost_scsi_cmd *cmd) +{ + struct iov_iter *iter = &cmd->saved_iter; + struct scatterlist *sg = cmd->tvc_sgl; + struct page *page; + size_t len; + int i; + + for (i = 0; i < cmd->tvc_sgl_count; i++) { + page = sg_page(&sg[i]); + len = sg[i].length; + + if (copy_page_to_iter(page, 0, len, iter) != len) { + pr_err("Could not copy data while handling misaligned cmd. Error %zu\n", + len); + return -1; + } + } + + return 0; +} + /* Fill in status and signal that we are done processing this command * * This is scheduled in the vhost work queue so we are called with the owner @@ -527,15 +559,20 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) pr_debug("%s tv_cmd %p resid %u status %#02x\n", __func__, cmd, se_cmd->residual_count, se_cmd->scsi_status); - memset(&v_rsp, 0, sizeof(v_rsp)); - v_rsp.resid = cpu_to_vhost32(cmd->tvc_vq, se_cmd->residual_count); - /* TODO is status_qualifier field needed? */ - v_rsp.status = se_cmd->scsi_status; - v_rsp.sense_len = cpu_to_vhost32(cmd->tvc_vq, -se_cmd->scsi_sense_length); - memcpy(v_rsp.sense, cmd->tvc_sense_buf, - se_cmd->scsi_sense_length); + + if (cmd->saved_iter_addr && vhost_scsi_copy_sgl_to_iov(cmd)) { + v_rsp.response = VIRTIO_SCSI_S_BAD_TARGET; +
[PATCH v2 2/2] vhost-scsi: Rename vhost_scsi_iov_to_sgl
Rename vhost_scsi_iov_to_sgl to vhost_scsi_map_iov_to_sgl so it matches matches the naming style used for vhost_scsi_copy_iov_to_sgl. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 324e4b3846fa..abef0619c790 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -780,8 +780,8 @@ vhost_scsi_copy_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter, } static int -vhost_scsi_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter, - struct scatterlist *sg, int sg_count, bool is_prot) +vhost_scsi_map_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter, + struct scatterlist *sg, int sg_count, bool is_prot) { struct scatterlist *p = sg; size_t revert_bytes; @@ -829,8 +829,9 @@ vhost_scsi_mapal(struct vhost_scsi_cmd *cmd, pr_debug("%s prot_sg %p prot_sgl_count %u\n", __func__, cmd->tvc_prot_sgl, cmd->tvc_prot_sgl_count); - ret = vhost_scsi_iov_to_sgl(cmd, prot_iter, cmd->tvc_prot_sgl, - cmd->tvc_prot_sgl_count, true); + ret = vhost_scsi_map_iov_to_sgl(cmd, prot_iter, + cmd->tvc_prot_sgl, + cmd->tvc_prot_sgl_count, true); if (ret < 0) { cmd->tvc_prot_sgl_count = 0; return ret; @@ -846,8 +847,8 @@ vhost_scsi_mapal(struct vhost_scsi_cmd *cmd, pr_debug("%s data_sg %p data_sgl_count %u\n", __func__, cmd->tvc_sgl, cmd->tvc_sgl_count); - ret = vhost_scsi_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl, - cmd->tvc_sgl_count, false); + ret = vhost_scsi_map_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl, + cmd->tvc_sgl_count, false); if (ret == -EINVAL) { sg_init_table(cmd->tvc_sgl, cmd->tvc_sgl_count); ret = vhost_scsi_copy_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl, -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 0/2] vhost-scsi: Fix IO hangs when using windows
The following patches were made over Linus's tree and fix an issue where windows guests will send iovecs with offset/lengths that result in IOs that are not aligned to 512. The LIO layer will then send them to Linux's FS/block layer but it requires 512 byte alignment, so depending on the FS/block driver being used we will get IO errors or hung IO. The following patches have vhost-scsi detect when windows sends these IOs and copy them to a bounce buffer. It then does some cleanup in the related code. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 16/17] vhost_scsi: add support for worker ioctls
This has vhost-scsi support the worker ioctls by calling the vhost_worker_ioctl helper. With a single worker, the single thread becomes a bottlneck when trying to use 3 or more virtqueues like: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=3 With the patches and doing a worker per vq, we can scale to at least 16 vCPUs/vqs (that's my system limit) with the same command fio command above with numjobs=16: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=64 --numjobs=16 which gives around 2002K IOPs. Note that for testing I dropped depth to 64 above because the vhost/virt layer supports only 1024 total commands per device. And the only tuning I did was set LIO's emulate_pr to 0 to avoid LIO's PR lock in the main IO path which becomes an issue at around 12 jobs/virtqueues. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 2c3cf487cc71..c83f7f043470 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1927,6 +1927,14 @@ vhost_scsi_ioctl(struct file *f, if (copy_from_user(&features, featurep, sizeof features)) return -EFAULT; return vhost_scsi_set_features(vs, features); + case VHOST_NEW_WORKER: + case VHOST_FREE_WORKER: + case VHOST_ATTACH_VRING_WORKER: + case VHOST_GET_VRING_WORKER: + mutex_lock(&vs->dev.mutex); + r = vhost_worker_ioctl(&vs->dev, ioctl, argp); + mutex_unlock(&vs->dev.mutex); + return r; default: mutex_lock(&vs->dev.mutex); r = vhost_dev_ioctl(&vs->dev, ioctl, argp); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 13/17] vhost: add helper to parse userspace vring state/file
The next patches add new vhost worker ioctls which will need to get a vhost_virtqueue from a userspace struct which specifies the vq's index. This moves the vhost_vring_ioctl code to do this to a helper so it can be shared. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 29 ++--- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 6cadbc6e6d11..12203d3893c5 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -599,6 +599,27 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) return NULL; } +static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp, + struct vhost_virtqueue **vq, u32 *id) +{ + u32 __user *idxp = argp; + u32 idx; + long r; + + r = get_user(idx, idxp); + if (r < 0) + return r; + + if (idx >= dev->nvqs) + return -ENOBUFS; + + idx = array_index_nospec(idx, dev->nvqs); + + *vq = dev->vqs[idx]; + *id = idx; + return 0; +} + /* Caller should have device mutex */ long vhost_dev_set_owner(struct vhost_dev *dev) { @@ -1618,21 +1639,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg struct file *eventfp, *filep = NULL; bool pollstart = false, pollstop = false; struct eventfd_ctx *ctx = NULL; - u32 __user *idxp = argp; struct vhost_virtqueue *vq; struct vhost_vring_state s; struct vhost_vring_file f; u32 idx; long r; - r = get_user(idx, idxp); + r = vhost_get_vq_from_user(d, argp, &vq, &idx); if (r < 0) return r; - if (idx >= d->nvqs) - return -ENOBUFS; - - idx = array_index_nospec(idx, d->nvqs); - vq = d->vqs[idx]; if (ioctl == VHOST_SET_VRING_NUM || ioctl == VHOST_SET_VRING_ADDR) { -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 05/17] vhost: take worker or vq instead of dev for queueing
This patch has the core work queueing function take a worker for when we support multiple workers. It also adds a helper that takes a vq during queueing so modules can control which vq/worker to queue work on. This temp leaves vhost_work_queue. It will be removed when the drivers are converted in the next patches. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 44 +++ drivers/vhost/vhost.h | 1 + 2 files changed, 29 insertions(+), 16 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index aafb23e12477..611e495eeb3c 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -231,21 +231,10 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop); -void vhost_dev_flush(struct vhost_dev *dev) +static bool vhost_worker_queue(struct vhost_worker *worker, + struct vhost_work *work) { - struct vhost_flush_struct flush; - - init_completion(&flush.wait_event); - vhost_work_init(&flush.work, vhost_flush_work); - - if (vhost_work_queue(dev, &flush.work)) - wait_for_completion(&flush.wait_event); -} -EXPORT_SYMBOL_GPL(vhost_dev_flush); - -bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) -{ - if (!dev->worker) + if (!worker) return false; /* * vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb @@ -257,14 +246,37 @@ bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) * sure it was not in the list. * test_and_set_bit() implies a memory barrier. */ - llist_add(&work->node, &dev->worker->work_list); - vhost_task_wake(dev->worker->vtsk); + llist_add(&work->node, &worker->work_list); + vhost_task_wake(worker->vtsk); } return true; } + +bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) +{ + return vhost_worker_queue(dev->worker, work); +} EXPORT_SYMBOL_GPL(vhost_work_queue); +bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) +{ + return vhost_worker_queue(vq->worker, work); +} +EXPORT_SYMBOL_GPL(vhost_vq_work_queue); + +void vhost_dev_flush(struct vhost_dev *dev) +{ + struct vhost_flush_struct flush; + + init_completion(&flush.wait_event); + vhost_work_init(&flush.work, vhost_flush_work); + + if (vhost_work_queue(dev, &flush.work)) + wait_for_completion(&flush.wait_event); +} +EXPORT_SYMBOL_GPL(vhost_dev_flush); + /* A lockless hint for busy polling code to exit the loop */ bool vhost_vq_has_work(struct vhost_virtqueue *vq) { diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index d5728f986744..e2dd4558a1f9 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -198,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *, struct vhost_log *log, unsigned int *log_num); void vhost_discard_vq_desc(struct vhost_virtqueue *, int n); +bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work); bool vhost_vq_has_work(struct vhost_virtqueue *vq); bool vhost_vq_is_setup(struct vhost_virtqueue *vq); int vhost_vq_init_access(struct vhost_virtqueue *); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 10/17] vhost_scsi: convert to vhost_vq_work_queue
Convert from vhost_work_queue to vhost_vq_work_queue so we can remove vhost_work_queue. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index a77c53bb035a..1668009bd489 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -353,8 +353,9 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd) if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) { struct vhost_scsi_tmf *tmf = container_of(se_cmd, struct vhost_scsi_tmf, se_cmd); + struct vhost_virtqueue *vq = &tmf->svq->vq; - vhost_work_queue(&tmf->vhost->dev, &tmf->vwork); + vhost_vq_work_queue(vq, &tmf->vwork); } else { struct vhost_scsi_cmd *cmd = container_of(se_cmd, struct vhost_scsi_cmd, tvc_se_cmd); @@ -1332,11 +1333,9 @@ static void vhost_scsi_ctl_handle_kick(struct vhost_work *work) } static void -vhost_scsi_send_evt(struct vhost_scsi *vs, - struct vhost_scsi_tpg *tpg, - struct se_lun *lun, - u32 event, - u32 reason) +vhost_scsi_send_evt(struct vhost_scsi *vs, struct vhost_virtqueue *vq, + struct vhost_scsi_tpg *tpg, struct se_lun *lun, + u32 event, u32 reason) { struct vhost_scsi_evt *evt; @@ -1358,7 +1357,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs, } llist_add(&evt->list, &vs->vs_event_list); - vhost_work_queue(&vs->dev, &vs->vs_event_work); + vhost_vq_work_queue(vq, &vs->vs_event_work); } static void vhost_scsi_evt_handle_kick(struct vhost_work *work) @@ -1372,7 +1371,8 @@ static void vhost_scsi_evt_handle_kick(struct vhost_work *work) goto out; if (vs->vs_events_missed) - vhost_scsi_send_evt(vs, NULL, NULL, VIRTIO_SCSI_T_NO_EVENT, 0); + vhost_scsi_send_evt(vs, vq, NULL, NULL, VIRTIO_SCSI_T_NO_EVENT, + 0); out: mutex_unlock(&vq->mutex); } @@ -1991,7 +1991,7 @@ vhost_scsi_do_plug(struct vhost_scsi_tpg *tpg, goto unlock; if (vhost_has_feature(vq, VIRTIO_SCSI_F_HOTPLUG)) - vhost_scsi_send_evt(vs, tpg, lun, + vhost_scsi_send_evt(vs, vq, tpg, lun, VIRTIO_SCSI_T_TRANSPORT_RESET, reason); unlock: mutex_unlock(&vq->mutex); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 12/17] vhost: remove vhost_work_queue
vhost_work_queue is no longer used. Each driver is using the poll or vq based queueing, so remove vhost_work_queue. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 6 -- drivers/vhost/vhost.h | 5 ++--- 2 files changed, 2 insertions(+), 9 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index b6149ed8acb4..6cadbc6e6d11 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -255,12 +255,6 @@ static bool vhost_worker_queue(struct vhost_worker *worker, return true; } -bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) -{ - return vhost_worker_queue(dev->worker, work); -} -EXPORT_SYMBOL_GPL(vhost_work_queue); - bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) { return vhost_worker_queue(vq->worker, work); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 3882266fbbf3..83f78b6f563b 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -44,15 +44,14 @@ struct vhost_poll { struct vhost_virtqueue *vq; }; -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); -bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); - void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, __poll_t mask, struct vhost_dev *dev, struct vhost_virtqueue *vq); int vhost_poll_start(struct vhost_poll *poll, struct file *file); void vhost_poll_stop(struct vhost_poll *poll); void vhost_poll_queue(struct vhost_poll *poll); + +void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); void vhost_dev_flush(struct vhost_dev *dev); struct vhost_log { -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 11/17] vhost_scsi: flush IO vqs then send TMF rsp
With one worker we will always send the scsi cmd responses then send the TMF rsp, because LIO will always complete the scsi cmds first then call into us to send the TMF response. With multiple workers, the IO vq workers could be running while the TMF/ctl vq worker is running so this has us do a flush before completing the TMF to make sure cmds are completed when it's work is later queued and run. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 1668009bd489..2c3cf487cc71 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1133,12 +1133,27 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work *work) { struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf, vwork); - int resp_code; + struct vhost_virtqueue *ctl_vq, *vq; + int resp_code, i; + + if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) { + /* +* Flush IO vqs that don't share a worker with the ctl to make +* sure they have sent their responses before us. +*/ + ctl_vq = &tmf->vhost->vqs[VHOST_SCSI_VQ_CTL].vq; + for (i = VHOST_SCSI_VQ_IO; i < tmf->vhost->dev.nvqs; i++) { + vq = &tmf->vhost->vqs[i].vq; + + if (vhost_vq_is_setup(vq) && + vq->worker != ctl_vq->worker) + vhost_vq_flush(vq); + } - if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) resp_code = VIRTIO_SCSI_S_FUNCTION_SUCCEEDED; - else + } else { resp_code = VIRTIO_SCSI_S_FUNCTION_REJECTED; + } vhost_scsi_send_tmf_resp(tmf->vhost, &tmf->svq->vq, tmf->in_iovs, tmf->vq_desc, &tmf->resp_iov, resp_code); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 08/17] vhost_sock: convert to vhost_vq_work_queue
Convert from vhost_work_queue to vhost_vq_work_queue, so we can drop vhost_work_queue. Signed-off-by: Mike Christie --- drivers/vhost/vsock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index 6578db78f0ae..817d377a3f36 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -285,7 +285,7 @@ vhost_transport_send_pkt(struct sk_buff *skb) atomic_inc(&vsock->queued_replies); virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); - vhost_work_queue(&vsock->dev, &vsock->send_pkt_work); + vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work); rcu_read_unlock(); return len; @@ -583,7 +583,7 @@ static int vhost_vsock_start(struct vhost_vsock *vsock) /* Some packets may have been queued before the device was started, * let's kick the send worker to send them. */ - vhost_work_queue(&vsock->dev, &vsock->send_pkt_work); + vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work); mutex_unlock(&vsock->dev.mutex); return 0; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 17/17] vhost: Allow worker switching while work is queueing
This patch drops the requirement that we can only switch workers if work has not been queued by using RCU for the vq based queueing paths and a mutex for the device wide flush. We can also use this to support SIGKILL properly in the future where we should exit almost immediately after getting that signal. With this patch, when get_signal returns true, we can set the vq->worker to NULL and do a synchronize_rcu to prevent new work from being queued to the vhost_task that has been killed. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 153 +++-- drivers/vhost/vhost.h | 4 +- include/uapi/linux/vhost.h | 4 +- 3 files changed, 115 insertions(+), 46 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index bfba91ecbd0a..c71d573f1c94 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -233,16 +233,9 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop); -static bool vhost_worker_queue(struct vhost_worker *worker, +static void vhost_worker_queue(struct vhost_worker *worker, struct vhost_work *work) { - if (!worker) - return false; - /* -* vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb -* when setting up the worker. We don't have a smp_rmb here because -* test_and_set_bit gives us a mb already. -*/ if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) { /* We can only add the work to the list after we're * sure it was not in the list. @@ -251,47 +244,85 @@ static bool vhost_worker_queue(struct vhost_worker *worker, llist_add(&work->node, &worker->work_list); vhost_task_wake(worker->vtsk); } - - return true; } bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) { - return vhost_worker_queue(vq->worker, work); + struct vhost_worker *worker; + bool queued = false; + + rcu_read_lock(); + worker = rcu_dereference(vq->worker); + if (worker) { + queued = true; + vhost_worker_queue(worker, work); + } + rcu_read_unlock(); + + return queued; } EXPORT_SYMBOL_GPL(vhost_vq_work_queue); -static void vhost_worker_flush(struct vhost_worker *worker) +void vhost_vq_flush(struct vhost_virtqueue *vq) { struct vhost_flush_struct flush; init_completion(&flush.wait_event); vhost_work_init(&flush.work, vhost_flush_work); - if (vhost_worker_queue(worker, &flush.work)) + if (vhost_vq_work_queue(vq, &flush.work)) wait_for_completion(&flush.wait_event); } +EXPORT_SYMBOL_GPL(vhost_vq_flush); -void vhost_vq_flush(struct vhost_virtqueue *vq) +/** + * vhost_worker_flush - flush a worker + * @worker: worker to flush + * + * This does not use RCU to protect the worker, so the device or worker + * mutex must be held. + */ +static void vhost_worker_flush(struct vhost_worker *worker) { - vhost_worker_flush(vq->worker); + struct vhost_flush_struct flush; + + init_completion(&flush.wait_event); + vhost_work_init(&flush.work, vhost_flush_work); + + vhost_worker_queue(worker, &flush.work); + wait_for_completion(&flush.wait_event); } -EXPORT_SYMBOL_GPL(vhost_vq_flush); void vhost_dev_flush(struct vhost_dev *dev) { struct vhost_worker *worker; unsigned long i; - xa_for_each(&dev->worker_xa, i, worker) + xa_for_each(&dev->worker_xa, i, worker) { + mutex_lock(&worker->mutex); + if (!worker->attachment_cnt) { + mutex_unlock(&worker->mutex); + continue; + } vhost_worker_flush(worker); + mutex_unlock(&worker->mutex); + } } EXPORT_SYMBOL_GPL(vhost_dev_flush); /* A lockless hint for busy polling code to exit the loop */ bool vhost_vq_has_work(struct vhost_virtqueue *vq) { - return !llist_empty(&vq->worker->work_list); + struct vhost_worker *worker; + bool has_work = false; + + rcu_read_lock(); + worker = rcu_dereference(vq->worker); + if (worker && !llist_empty(&worker->work_list)) + has_work = true; + rcu_read_unlock(); + + return has_work; } EXPORT_SYMBOL_GPL(vhost_vq_has_work); @@ -356,7 +387,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq->busyloop_timeout = 0; vq->umem = NULL; vq->iotlb = NULL; - vq->worker = NULL; + rcu_assign_pointer(vq->worker, NULL); vhost_vring_call_reset(&vq->call_ctx); __vhost_vq_meta_reset(vq); } @@ -578,7 +6
[PATCH v9 14/17] vhost: replace single worker pointer with xarray
The next patch allows userspace to create multiple workers per device, so this patch replaces the vhost_worker pointer with an xarray so we can store mupltiple workers and look them up. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 64 --- drivers/vhost/vhost.h | 3 +- 2 files changed, 50 insertions(+), 17 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 12203d3893c5..ffbaf7d32e2c 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -280,7 +280,11 @@ EXPORT_SYMBOL_GPL(vhost_vq_flush); void vhost_dev_flush(struct vhost_dev *dev) { - vhost_worker_flush(dev->worker); + struct vhost_worker *worker; + unsigned long i; + + xa_for_each(&dev->worker_xa, i, worker) + vhost_worker_flush(worker); } EXPORT_SYMBOL_GPL(vhost_dev_flush); @@ -482,7 +486,6 @@ void vhost_dev_init(struct vhost_dev *dev, dev->umem = NULL; dev->iotlb = NULL; dev->mm = NULL; - dev->worker = NULL; dev->iov_limit = iov_limit; dev->weight = weight; dev->byte_weight = byte_weight; @@ -492,7 +495,7 @@ void vhost_dev_init(struct vhost_dev *dev, INIT_LIST_HEAD(&dev->read_list); INIT_LIST_HEAD(&dev->pending_list); spin_lock_init(&dev->iotlb_lock); - + xa_init_flags(&dev->worker_xa, XA_FLAGS_ALLOC); for (i = 0; i < dev->nvqs; ++i) { vq = dev->vqs[i]; @@ -554,15 +557,35 @@ static void vhost_detach_mm(struct vhost_dev *dev) dev->mm = NULL; } -static void vhost_worker_free(struct vhost_dev *dev) +static void vhost_worker_destroy(struct vhost_dev *dev, +struct vhost_worker *worker) +{ + if (!worker) + return; + + WARN_ON(!llist_empty(&worker->work_list)); + xa_erase(&dev->worker_xa, worker->id); + vhost_task_stop(worker->vtsk); + kfree(worker); +} + +static void vhost_workers_free(struct vhost_dev *dev) { - if (!dev->worker) + struct vhost_worker *worker; + unsigned long i; + + if (!dev->use_worker) return; - WARN_ON(!llist_empty(&dev->worker->work_list)); - vhost_task_stop(dev->worker->vtsk); - kfree(dev->worker); - dev->worker = NULL; + for (i = 0; i < dev->nvqs; i++) + dev->vqs[i]->worker = NULL; + /* +* Free the default worker we created and cleanup workers userspace +* created but couldn't clean up (it forgot or crashed). +*/ + xa_for_each(&dev->worker_xa, i, worker) + vhost_worker_destroy(dev, worker); + xa_destroy(&dev->worker_xa); } static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) @@ -570,6 +593,8 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) struct vhost_worker *worker; struct vhost_task *vtsk; char name[TASK_COMM_LEN]; + int ret; + u32 id; worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); if (!worker) @@ -584,16 +609,18 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) init_llist_head(&worker->work_list); worker->kcov_handle = kcov_common_handle(); worker->vtsk = vtsk; - /* -* vsock can already try to queue so make sure llist and vtsk are both -* set before vhost_work_queue sees dev->worker is set. -*/ - smp_wmb(); - dev->worker = worker; vhost_task_start(vtsk); + + ret = xa_alloc(&dev->worker_xa, &id, worker, xa_limit_32b, GFP_KERNEL); + if (ret < 0) + goto stop_worker; + worker->id = id; + return worker; +stop_worker: + vhost_task_stop(vtsk); free_worker: kfree(worker); return NULL; @@ -650,6 +677,11 @@ long vhost_dev_set_owner(struct vhost_dev *dev) err = -ENOMEM; goto err_worker; } + /* +* vsock can already try to queue so make sure the worker +* is setup before vhost_vq_work_queue sees vq->worker is set. +*/ + smp_wmb(); for (i = 0; i < dev->nvqs; i++) dev->vqs[i]->worker = worker; @@ -751,7 +783,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev) dev->iotlb = NULL; vhost_clear_msg(dev); wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM); - vhost_worker_free(dev); + vhost_workers_free(dev); vhost_detach_mm(dev); } EXPORT_SYMBOL_GPL(vhost_dev_cleanup); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 83f78b6f563b
[PATCH v9 15/17] vhost: allow userspace to create workers
For vhost-scsi with 3 vqs or more and a workload that tries to use them in parallel like: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=3 the single vhost worker thread will become a bottlneck and we are stuck at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are used. To better utilize virtqueues and available CPUs, this patch allows userspace to create workers and bind them to vqs. You can have N workers per dev and also share N workers with M vqs on that dev. This patch adds the interface related code and the next patch will hook vhost-scsi into it. The patches do not try to hook net and vsock into the interface because: 1. multiple workers don't seem to help vsock. The problem is that with only 2 virtqueues we never fully use the existing worker when doing bidirectional tests. This seems to match vhost-scsi where we don't see the worker as a bottleneck until 3 virtqueues are used. 2. net already has a way to use multiple workers. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c| 142 ++- drivers/vhost/vhost.h| 3 + include/uapi/linux/vhost.h | 33 +++ include/uapi/linux/vhost_types.h | 16 4 files changed, 193 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index ffbaf7d32e2c..bfba91ecbd0a 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -626,6 +626,76 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) return NULL; } +/* Caller must have device mutex */ +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq, +struct vhost_worker *worker) +{ + if (vq->worker) + vq->worker->attachment_cnt--; + worker->attachment_cnt++; + vq->worker = worker; +} + + /* Caller must have device and virtqueue mutex */ +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq, + struct vhost_vring_worker *info) +{ + unsigned long index = info->worker_id; + struct vhost_dev *dev = vq->dev; + struct vhost_worker *worker; + + if (!dev->use_worker) + return -EINVAL; + + /* +* We only allow userspace to set a virtqueue's worker if it's not +* active and polling is not enabled. We also assume drivers +* supporting this will not be internally queueing works directly or +* via calls like vhost_dev_flush at this time. +*/ + if (vhost_vq_get_backend(vq) || vq->kick) + return -EBUSY; + + worker = xa_find(&dev->worker_xa, &index, UINT_MAX, XA_PRESENT); + if (!worker || worker->id != info->worker_id) + return -ENODEV; + + __vhost_vq_attach_worker(vq, worker); + return 0; +} + +/* Caller must have device mutex */ +static int vhost_new_worker(struct vhost_dev *dev, + struct vhost_worker_state *info) +{ + struct vhost_worker *worker; + + worker = vhost_worker_create(dev); + if (!worker) + return -ENOMEM; + + info->worker_id = worker->id; + return 0; +} + +/* Caller must have device mutex */ +static int vhost_free_worker(struct vhost_dev *dev, +struct vhost_worker_state *info) +{ + unsigned long index = info->worker_id; + struct vhost_worker *worker; + + worker = xa_find(&dev->worker_xa, &index, UINT_MAX, XA_PRESENT); + if (!worker || worker->id != info->worker_id) + return -ENODEV; + + if (worker->attachment_cnt) + return -EBUSY; + + vhost_worker_destroy(dev, worker); + return 0; +} + static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp, struct vhost_virtqueue **vq, u32 *id) { @@ -647,6 +717,76 @@ static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp, return 0; } +/* Caller must have device mutex */ +long vhost_worker_ioctl(struct vhost_dev *dev, unsigned int ioctl, + void __user *argp) +{ + struct vhost_vring_worker ring_worker; + struct vhost_worker_state state; + struct vhost_virtqueue *vq; + long ret; + u32 idx; + + if (!dev->use_worker) + return -EINVAL; + + if (!vhost_dev_has_owner(dev)) + return -EINVAL; + + ret = vhost_dev_check_owner(dev); + if (ret) + return ret; + + switch (ioctl) { + /* dev worker ioctls */ + case VHOST_NEW_WORKER: + ret = vhost_new_worker(dev, &state); + if (!ret && copy_to_user(argp, &state, sizeof(state))) + ret = -EFAULT; + return ret; + case VH
[PATCH v9 03/17] vhost: add vhost_worker pointer to vhost_virtqueue
This patchset allows userspace to map vqs to different workers. This patch adds a worker pointer to the vq so in later patches in this set we can queue/flush specific vqs and their workers. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 21 ++--- drivers/vhost/vhost.h | 1 + 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index dfd96cf6a152..db88464c1691 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -333,6 +333,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, vq->busyloop_timeout = 0; vq->umem = NULL; vq->iotlb = NULL; + vq->worker = NULL; vhost_vring_call_reset(&vq->call_ctx); __vhost_vq_meta_reset(vq); } @@ -545,7 +546,7 @@ static void vhost_worker_free(struct vhost_dev *dev) dev->worker = NULL; } -static int vhost_worker_create(struct vhost_dev *dev) +static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) { struct vhost_worker *worker; struct vhost_task *vtsk; @@ -553,7 +554,7 @@ static int vhost_worker_create(struct vhost_dev *dev) worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); if (!worker) - return -ENOMEM; + return NULL; snprintf(name, sizeof(name), "vhost-%d", current->pid); @@ -572,17 +573,18 @@ static int vhost_worker_create(struct vhost_dev *dev) dev->worker = worker; vhost_task_start(vtsk); - return 0; + return worker; free_worker: kfree(worker); - return -ENOMEM; + return NULL; } /* Caller should have device mutex */ long vhost_dev_set_owner(struct vhost_dev *dev) { - int err; + struct vhost_worker *worker; + int err, i; /* Is there an owner already? */ if (vhost_dev_has_owner(dev)) { @@ -603,9 +605,14 @@ long vhost_dev_set_owner(struct vhost_dev *dev) * below since we don't have to worry about vsock queueing * while we free the worker. */ - err = vhost_worker_create(dev); - if (err) + worker = vhost_worker_create(dev); + if (!worker) { + err = -ENOMEM; goto err_worker; + } + + for (i = 0; i < dev->nvqs; i++) + dev->vqs[i]->worker = worker; } return 0; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 6cd72d0b5245..0baacc245934 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -74,6 +74,7 @@ struct vhost_vring_call { /* The virtqueue structure describes a queue attached to a device. */ struct vhost_virtqueue { struct vhost_dev *dev; + struct vhost_worker *worker; /* The actual ring of buffers. */ struct mutex mutex; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 02/17] vhost: dynamically allocate vhost_worker
This patchset allows us to allocate multiple workers, so this has us move from the vhost_worker that's embedded in the vhost_dev to dynamically allocating it. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 66 --- drivers/vhost/vhost.h | 4 +-- 2 files changed, 45 insertions(+), 25 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 82966ffb4a5c..dfd96cf6a152 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -235,36 +235,40 @@ void vhost_dev_flush(struct vhost_dev *dev) { struct vhost_flush_struct flush; - if (dev->worker.vtsk) { - init_completion(&flush.wait_event); - vhost_work_init(&flush.work, vhost_flush_work); + init_completion(&flush.wait_event); + vhost_work_init(&flush.work, vhost_flush_work); - vhost_work_queue(dev, &flush.work); + if (vhost_work_queue(dev, &flush.work)) wait_for_completion(&flush.wait_event); - } } EXPORT_SYMBOL_GPL(vhost_dev_flush); -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) +bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) { - if (!dev->worker.vtsk) - return; - + if (!dev->worker) + return false; + /* +* vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb +* when setting up the worker. We don't have a smp_rmb here because +* test_and_set_bit gives us a mb already. +*/ if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) { /* We can only add the work to the list after we're * sure it was not in the list. * test_and_set_bit() implies a memory barrier. */ - llist_add(&work->node, &dev->worker.work_list); - vhost_task_wake(dev->worker.vtsk); + llist_add(&work->node, &dev->worker->work_list); + vhost_task_wake(dev->worker->vtsk); } + + return true; } EXPORT_SYMBOL_GPL(vhost_work_queue); /* A lockless hint for busy polling code to exit the loop */ bool vhost_has_work(struct vhost_dev *dev) { - return !llist_empty(&dev->worker.work_list); + return !llist_empty(&dev->worker->work_list); } EXPORT_SYMBOL_GPL(vhost_has_work); @@ -458,8 +462,7 @@ void vhost_dev_init(struct vhost_dev *dev, dev->umem = NULL; dev->iotlb = NULL; dev->mm = NULL; - memset(&dev->worker, 0, sizeof(dev->worker)); - init_llist_head(&dev->worker.work_list); + dev->worker = NULL; dev->iov_limit = iov_limit; dev->weight = weight; dev->byte_weight = byte_weight; @@ -533,30 +536,47 @@ static void vhost_detach_mm(struct vhost_dev *dev) static void vhost_worker_free(struct vhost_dev *dev) { - if (!dev->worker.vtsk) + if (!dev->worker) return; - WARN_ON(!llist_empty(&dev->worker.work_list)); - vhost_task_stop(dev->worker.vtsk); - dev->worker.kcov_handle = 0; - dev->worker.vtsk = NULL; + WARN_ON(!llist_empty(&dev->worker->work_list)); + vhost_task_stop(dev->worker->vtsk); + kfree(dev->worker); + dev->worker = NULL; } static int vhost_worker_create(struct vhost_dev *dev) { + struct vhost_worker *worker; struct vhost_task *vtsk; char name[TASK_COMM_LEN]; + worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); + if (!worker) + return -ENOMEM; + snprintf(name, sizeof(name), "vhost-%d", current->pid); - vtsk = vhost_task_create(vhost_worker, &dev->worker, name); + vtsk = vhost_task_create(vhost_worker, worker, name); if (!vtsk) - return -ENOMEM; + goto free_worker; + + init_llist_head(&worker->work_list); + worker->kcov_handle = kcov_common_handle(); + worker->vtsk = vtsk; + /* +* vsock can already try to queue so make sure llist and vtsk are both +* set before vhost_work_queue sees dev->worker is set. +*/ + smp_wmb(); + dev->worker = worker; - dev->worker.kcov_handle = kcov_common_handle(); - dev->worker.vtsk = vtsk; vhost_task_start(vtsk); return 0; + +free_worker: + kfree(worker); + return -ENOMEM; } /* Caller should have device mutex */ diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 37ce869f8a5c..6cd72d0b5245 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -44,7 +44,7 @@ struct vhost_poll { }; void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); -void vhost_w
[PATCH v9 09/17] vhost_scsi: make SCSI cmd completion per vq
This patch separates the scsi cmd completion code paths so we can complete cmds based on their vq instead of having all cmds complete on the same worker/CPU. This will be useful with the next patches that allow us to create mulitple worker threads and bind them to different vqs, and we can have completions running on different threads/CPUs. Signed-off-by: Mike Christie Reviewed-by: Stefan Hajnoczi --- drivers/vhost/scsi.c | 56 1 file changed, 26 insertions(+), 30 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index bb10fa4bb4f6..a77c53bb035a 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -167,6 +167,7 @@ MODULE_PARM_DESC(max_io_vqs, "Set the max number of IO virtqueues a vhost scsi d struct vhost_scsi_virtqueue { struct vhost_virtqueue vq; + struct vhost_scsi *vs; /* * Reference counting for inflight reqs, used for flush operation. At * each time, one reference tracks new commands submitted, while we @@ -181,6 +182,9 @@ struct vhost_scsi_virtqueue { struct vhost_scsi_cmd *scsi_cmds; struct sbitmap scsi_tags; int max_cmds; + + struct vhost_work completion_work; + struct llist_head completion_list; }; struct vhost_scsi { @@ -190,12 +194,8 @@ struct vhost_scsi { struct vhost_dev dev; struct vhost_scsi_virtqueue *vqs; - unsigned long *compl_bitmap; struct vhost_scsi_inflight **old_inflight; - struct vhost_work vs_completion_work; /* cmd completion work item */ - struct llist_head vs_completion_list; /* cmd completion queue */ - struct vhost_work vs_event_work; /* evt injection work item */ struct llist_head vs_event_list; /* evt injection queue */ @@ -358,10 +358,11 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd) } else { struct vhost_scsi_cmd *cmd = container_of(se_cmd, struct vhost_scsi_cmd, tvc_se_cmd); - struct vhost_scsi *vs = cmd->tvc_vhost; + struct vhost_scsi_virtqueue *svq = container_of(cmd->tvc_vq, + struct vhost_scsi_virtqueue, vq); - llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list); - vhost_work_queue(&vs->dev, &vs->vs_completion_work); + llist_add(&cmd->tvc_completion_list, &svq->completion_list); + vhost_vq_work_queue(&svq->vq, &svq->completion_work); } } @@ -509,17 +510,17 @@ static void vhost_scsi_evt_work(struct vhost_work *work) */ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) { - struct vhost_scsi *vs = container_of(work, struct vhost_scsi, - vs_completion_work); + struct vhost_scsi_virtqueue *svq = container_of(work, + struct vhost_scsi_virtqueue, completion_work); struct virtio_scsi_cmd_resp v_rsp; struct vhost_scsi_cmd *cmd, *t; struct llist_node *llnode; struct se_cmd *se_cmd; struct iov_iter iov_iter; - int ret, vq; + bool signal = false; + int ret; - bitmap_zero(vs->compl_bitmap, vs->dev.nvqs); - llnode = llist_del_all(&vs->vs_completion_list); + llnode = llist_del_all(&svq->completion_list); llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) { se_cmd = &cmd->tvc_se_cmd; @@ -539,21 +540,17 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) cmd->tvc_in_iovs, sizeof(v_rsp)); ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter); if (likely(ret == sizeof(v_rsp))) { - struct vhost_scsi_virtqueue *q; + signal = true; + vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0); - q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq); - vq = q - vs->vqs; - __set_bit(vq, vs->compl_bitmap); } else pr_err("Faulted on virtio_scsi_cmd_resp\n"); vhost_scsi_release_cmd_res(se_cmd); } - vq = -1; - while ((vq = find_next_bit(vs->compl_bitmap, vs->dev.nvqs, vq + 1)) - < vs->dev.nvqs) - vhost_signal(&vs->dev, &vs->vqs[vq].vq); + if (signal) + vhost_signal(&svq->vs->dev, &svq->vq); } static struct vhost_scsi_cmd * @@ -1770,6 +1767,7 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features) static int vhost_scsi_open(struct inode *inode, struct file *f) { + struct vhost_scsi_virtqueue *svq;
[PATCH v9 04/17] vhost, vhost_net: add helper to check if vq has work
In the next patches each vq might have different workers so one could have work but others do not. For net, we only want to check specific vqs, so this adds a helper to check if a vq has work pending and converts vhost-net to use it. Signed-off-by: Mike Christie Acked-by: Jason Wang --- drivers/vhost/net.c | 2 +- drivers/vhost/vhost.c | 6 +++--- drivers/vhost/vhost.h | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index ae2273196b0c..98bb957eb3b9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -546,7 +546,7 @@ static void vhost_net_busy_poll(struct vhost_net *net, endtime = busy_clock() + busyloop_timeout; while (vhost_can_busy_poll(endtime)) { - if (vhost_has_work(&net->dev)) { + if (vhost_vq_has_work(vq)) { *busyloop_intr = true; break; } diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index db88464c1691..aafb23e12477 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -266,11 +266,11 @@ bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) EXPORT_SYMBOL_GPL(vhost_work_queue); /* A lockless hint for busy polling code to exit the loop */ -bool vhost_has_work(struct vhost_dev *dev) +bool vhost_vq_has_work(struct vhost_virtqueue *vq) { - return !llist_empty(&dev->worker->work_list); + return !llist_empty(&vq->worker->work_list); } -EXPORT_SYMBOL_GPL(vhost_has_work); +EXPORT_SYMBOL_GPL(vhost_vq_has_work); void vhost_poll_queue(struct vhost_poll *poll) { diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 0baacc245934..d5728f986744 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -45,7 +45,6 @@ struct vhost_poll { void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); -bool vhost_has_work(struct vhost_dev *dev); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, __poll_t mask, struct vhost_dev *dev); @@ -199,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *, struct vhost_log *log, unsigned int *log_num); void vhost_discard_vq_desc(struct vhost_virtqueue *, int n); +bool vhost_vq_has_work(struct vhost_virtqueue *vq); bool vhost_vq_is_setup(struct vhost_virtqueue *vq); int vhost_vq_init_access(struct vhost_virtqueue *); int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 06/17] vhost: take worker or vq for flushing
This patch has the core work flush function take a worker. When we support multiple workers we can then flush each worker during device removal, stoppage, etc. It also adds a helper to flush specific virtqueues, so vhost-scsi can flush IO vqs from it's ctl vq. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 15 +-- drivers/vhost/vhost.h | 1 + 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 611e495eeb3c..2ea107e867a1 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -265,16 +265,27 @@ bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) } EXPORT_SYMBOL_GPL(vhost_vq_work_queue); -void vhost_dev_flush(struct vhost_dev *dev) +static void vhost_worker_flush(struct vhost_worker *worker) { struct vhost_flush_struct flush; init_completion(&flush.wait_event); vhost_work_init(&flush.work, vhost_flush_work); - if (vhost_work_queue(dev, &flush.work)) + if (vhost_worker_queue(worker, &flush.work)) wait_for_completion(&flush.wait_event); } + +void vhost_vq_flush(struct vhost_virtqueue *vq) +{ + vhost_worker_flush(vq->worker); +} +EXPORT_SYMBOL_GPL(vhost_vq_flush); + +void vhost_dev_flush(struct vhost_dev *dev) +{ + vhost_worker_flush(dev->worker); +} EXPORT_SYMBOL_GPL(vhost_dev_flush); /* A lockless hint for busy polling code to exit the loop */ diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index e2dd4558a1f9..f208f9923c88 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -198,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *, struct vhost_log *log, unsigned int *log_num); void vhost_discard_vq_desc(struct vhost_virtqueue *, int n); +void vhost_vq_flush(struct vhost_virtqueue *vq); bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work); bool vhost_vq_has_work(struct vhost_virtqueue *vq); bool vhost_vq_is_setup(struct vhost_virtqueue *vq); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 01/17] vhost: create worker at end of vhost_dev_set_owner
vsock can start queueing work after VHOST_VSOCK_SET_GUEST_CID, so after we have called vhost_worker_create it can be calling vhost_work_queue and trying to access the vhost worker/task. If vhost_dev_alloc_iovecs fails, then vhost_worker_free could free the worker/task from under vsock. This moves vhost_worker_create to the end of vhost_dev_set_owner where we know we can no longer fail in that path. If it fails after the VHOST_SET_OWNER and userspace closes the device, then the normal vsock release handling will do the right thing. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 60c9ebd629dd..82966ffb4a5c 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -572,20 +572,27 @@ long vhost_dev_set_owner(struct vhost_dev *dev) vhost_attach_mm(dev); + err = vhost_dev_alloc_iovecs(dev); + if (err) + goto err_iovecs; + if (dev->use_worker) { + /* +* This should be done last, because vsock can queue work +* before VHOST_SET_OWNER so it simplifies the failure path +* below since we don't have to worry about vsock queueing +* while we free the worker. +*/ err = vhost_worker_create(dev); if (err) goto err_worker; } - err = vhost_dev_alloc_iovecs(dev); - if (err) - goto err_iovecs; - return 0; -err_iovecs: - vhost_worker_free(dev); + err_worker: + vhost_dev_free_iovecs(dev); +err_iovecs: vhost_detach_mm(dev); err_mm: return err; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 00/17] vhost: multiple worker support
The following patches were built over Linux's tree. The also apply over the mst vhost branch if you revert the previous vhost worker patchset. The patches allow us to support multiple vhost workers tasks per device. The design is a modified version of Stefan's original idea where userspace has the kernel create a worker and we pass back the pid. In this version instead of passing the pid between user/kernel space we use a worker_id which is just an integer managed by the vhost driver and we allow userspace to create and free workers and then attach them to virtqueues at setup time. All review comments from the past reviews should be handled. If I didn't reply to a review comment, I agreed with the comment and should have handled it in this posting. Let me know if I missed one. Results: fio jobs1 2 4 8 12 16 -- 1 worker160k 488k - - - - worker per vq 160k 310k620k1182K1564K 2002k Notes: 0. This used a simple fio command: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=$JOBS_ABOVE and I used a VM with 16 vCPUs and 16 virtqueues. 1. The patches were tested with LIO's emulate_pr=0 which drops the LIO PR lock use. This was a bottleneck at around 12 vqs/jobs. 2. Because we have a hard limit of 1024 cmds, if the num jobs * iodepth was greater than 1024, I would decrease iodepth. So 12 jobs used 85 cmds, and 16 used 64. 3. The perf issue above at 2 jobs is because when we only have 1 worker we execute more cmds per vhost_work due to all vqs funneling to one worker. I have 2 patches that fix this. One is just submit from the worker thread instead of kikcing off to a workqueue like how the vhost block patches do. I'll post this after the worker support is merged. I'm also working on patches to add back batching during lio completion and do polling on the submission side. We will still want the threading patches, because if we batch at the fio level plus use the vhost theading patches, we can see a big boost like below. So hopefully doing it at the kernel will allow apps to just work without having to be smart like fio. fio using io_uring and batching with the iodepth_batch* settings: fio jobs1 2 4 8 12 16 - 1 worker494k520k- - - - worker per vq 496k878k1542k 2436k 2304k 2590k V9: - Fix vhost_dev_reset_owner handling. Make sure a virtqueue's worker is cleared so when the vhost_dev is reused the old worker is not used. - Remove old/unused attach ioctl copy to user code. V8: - Rebase. - Move comments about assumptions so it's in the body of the code instead of top of function so it's more clear. - Added patch for RCU support and swapping workers while works are running. V7: - Add more comments about assumptions. - Drop refcounting and just use an attachment_cnt variable, so there is no confusion about when a worker is freed. - Do a opt-in model, because vsiock has an issue where it can queue works before it's running and it doesn't even need multiple workers, so there is no point in chaning the driver or core code. - Add back get worker ioctl. - Broke up last patches to make it easier to read. V6: - Rebase against vhost_task patchset. - Used xa instead of idr. V5: - Rebase against user_worker patchset. - Rebase against flush patchset. - Redo vhost-scsi tmf flush handling so it doesn't access vq->worker. V4: - fix vhost-sock VSOCK_VQ_RX use. - name functions called directly by ioctl cmd's to match the ioctl cmd. - break up VHOST_SET_VRING_WORKER into a new, free and attach cmd. - document worker lifetime, and cgroup, namespace, mm, rlimit inheritance, make it clear we currently only support sharing within the device. - add support to attach workers while IO is running. - instead of passing a pid_t of the kernel thread, pass a int allocated by the vhost layer with an idr. V3: - fully convert vhost code to use vq based APIs instead of leaving it half per dev and half per vq. - rebase against kernel worker API. - Drop delayed worker creation. We always create the default worker at VHOST_SET_OWNER time. Userspace can create and bind workers after that. V2: - change loop that we take a refcount to the worker in - replaced pid == -1 with define. - fixed tabbing/spacing coding style issue - use hash instead of list to lookup workers. - I dropped the patch that added an ioctl cmd to get a vq's worker's pid. I saw we might do a generic netlink interface instead. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v9 07/17] vhost: convert poll work to be vq based
This has the drivers pass in their poll to vq mapping and then converts the core poll code to use the vq based helpers. In the next patches we will allow vqs to be handled by different workers, so to allow drivers to execute operations like queue, stop, flush, etc on specific polls/vqs we need to know the mappings. Signed-off-by: Mike Christie --- drivers/vhost/net.c | 6 -- drivers/vhost/vhost.c | 8 +--- drivers/vhost/vhost.h | 4 +++- 3 files changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 98bb957eb3b9..f2ed7167c848 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1347,8 +1347,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, NULL); - vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev); - vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev); + vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev, + vqs[VHOST_NET_VQ_TX]); + vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev, + vqs[VHOST_NET_VQ_RX]); f->private_data = n; n->page_frag.page = NULL; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2ea107e867a1..b6149ed8acb4 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -187,13 +187,15 @@ EXPORT_SYMBOL_GPL(vhost_work_init); /* Init poll structure */ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, -__poll_t mask, struct vhost_dev *dev) +__poll_t mask, struct vhost_dev *dev, +struct vhost_virtqueue *vq) { init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup); init_poll_funcptr(&poll->table, vhost_poll_func); poll->mask = mask; poll->dev = dev; poll->wqh = NULL; + poll->vq = vq; vhost_work_init(&poll->work, fn); } @@ -297,7 +299,7 @@ EXPORT_SYMBOL_GPL(vhost_vq_has_work); void vhost_poll_queue(struct vhost_poll *poll) { - vhost_work_queue(poll->dev, &poll->work); + vhost_vq_work_queue(poll->vq, &poll->work); } EXPORT_SYMBOL_GPL(vhost_poll_queue); @@ -508,7 +510,7 @@ void vhost_dev_init(struct vhost_dev *dev, vhost_vq_reset(dev, vq); if (vq->handle_kick) vhost_poll_init(&vq->poll, vq->handle_kick, - EPOLLIN, dev); + EPOLLIN, dev, vq); } } EXPORT_SYMBOL_GPL(vhost_dev_init); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index f208f9923c88..3882266fbbf3 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -41,13 +41,15 @@ struct vhost_poll { struct vhost_work work; __poll_tmask; struct vhost_dev*dev; + struct vhost_virtqueue *vq; }; void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, -__poll_t mask, struct vhost_dev *dev); +__poll_t mask, struct vhost_dev *dev, +struct vhost_virtqueue *vq); int vhost_poll_start(struct vhost_poll *poll, struct file *file); void vhost_poll_stop(struct vhost_poll *poll); void vhost_poll_queue(struct vhost_poll *poll); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [syzbot] [net?] [virt?] [kvm?] KASAN: slab-use-after-free Read in __vhost_vq_attach_worker
On 6/26/23 2:15 AM, Michael S. Tsirkin wrote: > On Mon, Jun 26, 2023 at 12:06:54AM -0700, syzbot wrote: >> Hello, >> >> syzbot found the following issue on: >> >> HEAD commit:8d2be868b42c Add linux-next specific files for 20230623 >> git tree: linux-next >> console+strace: https://syzkaller.appspot.com/x/log.txt?x=12872950a8 >> kernel config: https://syzkaller.appspot.com/x/.config?x=d8ac8dd33677e8e0 >> dashboard link: https://syzkaller.appspot.com/bug?extid=8540db210d403f1aa214 >> compiler: gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils >> for Debian) 2.35.2 >> syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15c1b70f28 >> C reproducer: https://syzkaller.appspot.com/x/repro.c?x=122ee4cb28 >> >> Downloadable assets: >> disk image: >> https://storage.googleapis.com/syzbot-assets/2a004483aca3/disk-8d2be868.raw.xz >> vmlinux: >> https://storage.googleapis.com/syzbot-assets/5688cb13b277/vmlinux-8d2be868.xz >> kernel image: >> https://storage.googleapis.com/syzbot-assets/76de0b63bc53/bzImage-8d2be868.xz >> >> The issue was bisected to: >> >> commit 21a18f4a51896fde11002165f0e7340f4131d6a0 >> Author: Mike Christie >> Date: Tue Jun 13 01:32:46 2023 + >> >> vhost: allow userspace to create workers >> >> bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=130850bf28 >> final oops: https://syzkaller.appspot.com/x/report.txt?x=108850bf28 >> console output: https://syzkaller.appspot.com/x/log.txt?x=170850bf28 >> >> IMPORTANT: if you fix the issue, please add the following tag to the commit: >> Reported-by: syzbot+8540db210d403f1aa...@syzkaller.appspotmail.com >> Fixes: 21a18f4a5189 ("vhost: allow userspace to create workers") > > Mike, would appreciate prompt attention to this as I am preparing > a pull request for the merge window and need to make a > decision on whether to include your userspace-controlled > threading patchset. > Do you want me to resubmit the patchset or submit a patch against your vhost branch? The bug is that vhost-net can call vhost_dev_reset_owner and that will free the workers. However, I leave the virtqueue->worker pointer set so we end up referencing the freed workers later on. When I handled a review comment between v5 and v6, I deleted that code thinking it was also not needed. So the fix is: diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index ab79b064aade..5a07e220e46d 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -607,6 +607,10 @@ static void vhost_workers_free(struct vhost_dev *dev) if (!dev->use_worker) return; + + for (i = 0; i < dev->nvqs; i++) + rcu_assign_pointer(dev->vqs[i]->worker, NULL); + /* * Free the default worker we created and cleanup workers userspace * created but couldn't clean up (it forgot or crashed). ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: Can vhost translate to io_uring?
On 6/14/23 1:25 AM, michael.chris...@oracle.com wrote: > It would be nice if the vhost layer could use the io-wq code as sort of > generic worker. I can look into what that would take if Jens is ok > with that type of thing. We could use the io-wq code, but we hit the same problems as before: 1. We still need to modify the vhost drivers like I mentioned below so when the task gets SIGKILL the drivers fail instead of running the work like normal. 2. We still need some code like the patch below so when the worker task exits and is freed the vhost drivers stop calling io_wq_enqueue and don't access the io_wq anymore. 3. There's some other small things which seem easy to change like we need to create the worker thread/task_struct when io_wq_create is run instead of io_wq_enqueue. The problem is that we can queue work from threads that have different mms than we want to use. I've done #2 in the patch below. I'm almost done with #1. Just testing it now. When that's done, we can remove the signal hacks and then decide if we want to go further and switch to io-wq. > > For vhost, I just submitted a patch to the vhost layer that allow us to > switch out the vhost worker thread when IO is running: > > https://lists.linuxfoundation.org/pipermail/virtualization/2023-June/067246.html > > After that patch, I just need to modify vhost_worker/vhost_task_fn so > when get_signal returns true we set the worker to NULL and do a > synchronize_rcu. > Then I just need to modify vhost-scsi so it detects when the worker is NULL > and modify the flush code so it handles work that didn't get to run. > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v8 17/17] vhost: Allow worker switching while work is queueing
On 6/12/23 8:32 PM, Mike Christie wrote: > This patch drops the requirement that we can only switch workers if work > has not been queued by using RCU for the vq based queueing paths and a > mutex for the device wide flush. > > We can also use this to support SIGKILL properly in the future where we > should exit almost immediately after getting that signal. With this > patch, when get_signal returns true, we can set the vq->worker to NULL > and do a synchronize_rcu to prevent new work from being queued to the > vhost_task that has been killed. > Hey Jason and Stefano, just an update on why we now have this extra patch. Jason, in one of the last reviews you were asking about supporting switching workers when works are queueing and I had said it's probably not worth it, because it adds come complexity that might not be used. Stefano, it sounded like you preferred RCU to handle when we are adding the initial worker while vsock is possibly queueing works. It turns out the signal/fork developers added some hacks to their code for the vhost_task patches to support SIGKILL but I think they want us to eventually remove them. To do that, we need a way to handle where the vhost_worker/vhost_task exits while work is being queued. To do this we need the same as the above (instead of a new worker it would be NULL though). So to handle all these requests, I ended up adding this last patch. I wasn't sure if you wanted it to go in at the same time as the initial multiple worker patches since it changes the behavior the interface can support or separate when I fix up the SIGKILL code. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 14/17] vhost: replace single worker pointer with xarray
The next patch allows userspace to create multiple workers per device, so this patch replaces the vhost_worker pointer with an xarray so we can store mupltiple workers and look them up. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 61 +++ drivers/vhost/vhost.h | 3 ++- 2 files changed, 47 insertions(+), 17 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 0db093dfaa22..f07f92a67e59 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -280,7 +280,11 @@ EXPORT_SYMBOL_GPL(vhost_vq_flush); void vhost_dev_flush(struct vhost_dev *dev) { - vhost_worker_flush(dev->worker); + struct vhost_worker *worker; + unsigned long i; + + xa_for_each(&dev->worker_xa, i, worker) + vhost_worker_flush(worker); } EXPORT_SYMBOL_GPL(vhost_dev_flush); @@ -481,7 +485,6 @@ void vhost_dev_init(struct vhost_dev *dev, dev->umem = NULL; dev->iotlb = NULL; dev->mm = NULL; - dev->worker = NULL; dev->iov_limit = iov_limit; dev->weight = weight; dev->byte_weight = byte_weight; @@ -491,7 +494,7 @@ void vhost_dev_init(struct vhost_dev *dev, INIT_LIST_HEAD(&dev->read_list); INIT_LIST_HEAD(&dev->pending_list); spin_lock_init(&dev->iotlb_lock); - + xa_init_flags(&dev->worker_xa, XA_FLAGS_ALLOC); for (i = 0; i < dev->nvqs; ++i) { vq = dev->vqs[i]; @@ -554,15 +557,32 @@ static void vhost_detach_mm(struct vhost_dev *dev) dev->mm = NULL; } -static void vhost_worker_free(struct vhost_dev *dev) +static void vhost_worker_destroy(struct vhost_dev *dev, +struct vhost_worker *worker) { - if (!dev->worker) + if (!worker) return; - WARN_ON(!llist_empty(&dev->worker->work_list)); - vhost_task_stop(dev->worker->vtsk); - kfree(dev->worker); - dev->worker = NULL; + WARN_ON(!llist_empty(&worker->work_list)); + xa_erase(&dev->worker_xa, worker->id); + vhost_task_stop(worker->vtsk); + kfree(worker); +} + +static void vhost_workers_free(struct vhost_dev *dev) +{ + struct vhost_worker *worker; + unsigned long i; + + if (!dev->use_worker) + return; + /* +* Free the default worker we created and cleanup workers userspace +* created but couldn't clean up (it forgot or crashed). +*/ + xa_for_each(&dev->worker_xa, i, worker) + vhost_worker_destroy(dev, worker); + xa_destroy(&dev->worker_xa); } static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) @@ -570,6 +590,8 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) struct vhost_worker *worker; struct vhost_task *vtsk; char name[TASK_COMM_LEN]; + int ret; + u32 id; worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); if (!worker) @@ -584,16 +606,18 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) init_llist_head(&worker->work_list); worker->kcov_handle = kcov_common_handle(); worker->vtsk = vtsk; - /* -* vsock can already try to queue so make sure llist and vtsk are both -* set before vhost_work_queue sees dev->worker is set. -*/ - smp_wmb(); - dev->worker = worker; vhost_task_start(vtsk); + + ret = xa_alloc(&dev->worker_xa, &id, worker, xa_limit_32b, GFP_KERNEL); + if (ret < 0) + goto stop_worker; + worker->id = id; + return worker; +stop_worker: + vhost_task_stop(vtsk); free_worker: kfree(worker); return NULL; @@ -650,6 +674,11 @@ long vhost_dev_set_owner(struct vhost_dev *dev) err = -ENOMEM; goto err_worker; } + /* +* vsock can already try to queue so make sure the worker +* is setup before vhost_vq_work_queue sees vq->worker is set. +*/ + smp_wmb(); for (i = 0; i < dev->nvqs; i++) dev->vqs[i]->worker = worker; @@ -751,7 +780,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev) dev->iotlb = NULL; vhost_clear_msg(dev); wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM); - vhost_worker_free(dev); + vhost_workers_free(dev); vhost_detach_mm(dev); } EXPORT_SYMBOL_GPL(vhost_dev_cleanup); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index b850f534bc9a..31937e98c01b 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -30,6 +30,7 @@ struct vhost_worker {
[PATCH v8 17/17] vhost: Allow worker switching while work is queueing
This patch drops the requirement that we can only switch workers if work has not been queued by using RCU for the vq based queueing paths and a mutex for the device wide flush. We can also use this to support SIGKILL properly in the future where we should exit almost immediately after getting that signal. With this patch, when get_signal returns true, we can set the vq->worker to NULL and do a synchronize_rcu to prevent new work from being queued to the vhost_task that has been killed. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 151 +++-- drivers/vhost/vhost.h | 4 +- include/uapi/linux/vhost.h | 4 +- 3 files changed, 114 insertions(+), 45 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a4f97b56f1b4..ab79b064aade 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -233,16 +233,9 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop); -static bool vhost_worker_queue(struct vhost_worker *worker, +static void vhost_worker_queue(struct vhost_worker *worker, struct vhost_work *work) { - if (!worker) - return false; - /* -* vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb -* when setting up the worker. We don't have a smp_rmb here because -* test_and_set_bit gives us a mb already. -*/ if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) { /* We can only add the work to the list after we're * sure it was not in the list. @@ -251,47 +244,85 @@ static bool vhost_worker_queue(struct vhost_worker *worker, llist_add(&work->node, &worker->work_list); vhost_task_wake(worker->vtsk); } - - return true; } bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) { - return vhost_worker_queue(vq->worker, work); + struct vhost_worker *worker; + bool queued = false; + + rcu_read_lock(); + worker = rcu_dereference(vq->worker); + if (worker) { + queued = true; + vhost_worker_queue(worker, work); + } + rcu_read_unlock(); + + return queued; } EXPORT_SYMBOL_GPL(vhost_vq_work_queue); -static void vhost_worker_flush(struct vhost_worker *worker) +void vhost_vq_flush(struct vhost_virtqueue *vq) { struct vhost_flush_struct flush; init_completion(&flush.wait_event); vhost_work_init(&flush.work, vhost_flush_work); - if (vhost_worker_queue(worker, &flush.work)) + if (vhost_vq_work_queue(vq, &flush.work)) wait_for_completion(&flush.wait_event); } +EXPORT_SYMBOL_GPL(vhost_vq_flush); -void vhost_vq_flush(struct vhost_virtqueue *vq) +/** + * vhost_worker_flush - flush a worker + * @worker: worker to flush + * + * This does not use RCU to protect the worker, so the device or worker + * mutex must be held. + */ +static void vhost_worker_flush(struct vhost_worker *worker) { - vhost_worker_flush(vq->worker); + struct vhost_flush_struct flush; + + init_completion(&flush.wait_event); + vhost_work_init(&flush.work, vhost_flush_work); + + vhost_worker_queue(worker, &flush.work); + wait_for_completion(&flush.wait_event); } -EXPORT_SYMBOL_GPL(vhost_vq_flush); void vhost_dev_flush(struct vhost_dev *dev) { struct vhost_worker *worker; unsigned long i; - xa_for_each(&dev->worker_xa, i, worker) + xa_for_each(&dev->worker_xa, i, worker) { + mutex_lock(&worker->mutex); + if (!worker->attachment_cnt) { + mutex_unlock(&worker->mutex); + continue; + } vhost_worker_flush(worker); + mutex_unlock(&worker->mutex); + } } EXPORT_SYMBOL_GPL(vhost_dev_flush); /* A lockless hint for busy polling code to exit the loop */ bool vhost_vq_has_work(struct vhost_virtqueue *vq) { - return !llist_empty(&vq->worker->work_list); + struct vhost_worker *worker; + bool has_work = false; + + rcu_read_lock(); + worker = rcu_dereference(vq->worker); + if (worker && !llist_empty(&worker->work_list)) + has_work = true; + rcu_read_unlock(); + + return has_work; } EXPORT_SYMBOL_GPL(vhost_vq_has_work); @@ -501,7 +532,7 @@ void vhost_dev_init(struct vhost_dev *dev, vq->log = NULL; vq->indirect = NULL; vq->heads = NULL; - vq->worker = NULL; + rcu_assign_pointer(vq->worker, NULL); vq->dev = dev; mutex_init(&vq->mutex);
[PATCH v8 16/17] vhost_scsi: add support for worker ioctls
This has vhost-scsi support the worker ioctls by calling the vhost_worker_ioctl helper. With a single worker, the single thread becomes a bottlneck when trying to use 3 or more virtqueues like: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=3 With the patches and doing a worker per vq, we can scale to at least 16 vCPUs/vqs (that's my system limit) with the same command fio command above with numjobs=16: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=64 --numjobs=16 which gives around 2002K IOPs. Note that for testing I dropped depth to 64 above because the vhost/virt layer supports only 1024 total commands per device. And the only tuning I did was set LIO's emulate_pr to 0 to avoid LIO's PR lock in the main IO path which becomes an issue at around 12 jobs/virtqueues. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 8 1 file changed, 8 insertions(+) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 2c3cf487cc71..c83f7f043470 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1927,6 +1927,14 @@ vhost_scsi_ioctl(struct file *f, if (copy_from_user(&features, featurep, sizeof features)) return -EFAULT; return vhost_scsi_set_features(vs, features); + case VHOST_NEW_WORKER: + case VHOST_FREE_WORKER: + case VHOST_ATTACH_VRING_WORKER: + case VHOST_GET_VRING_WORKER: + mutex_lock(&vs->dev.mutex); + r = vhost_worker_ioctl(&vs->dev, ioctl, argp); + mutex_unlock(&vs->dev.mutex); + return r; default: mutex_lock(&vs->dev.mutex); r = vhost_dev_ioctl(&vs->dev, ioctl, argp); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 10/17] vhost_scsi: convert to vhost_vq_work_queue
Convert from vhost_work_queue to vhost_vq_work_queue so we can remove vhost_work_queue. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index a77c53bb035a..1668009bd489 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -353,8 +353,9 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd) if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) { struct vhost_scsi_tmf *tmf = container_of(se_cmd, struct vhost_scsi_tmf, se_cmd); + struct vhost_virtqueue *vq = &tmf->svq->vq; - vhost_work_queue(&tmf->vhost->dev, &tmf->vwork); + vhost_vq_work_queue(vq, &tmf->vwork); } else { struct vhost_scsi_cmd *cmd = container_of(se_cmd, struct vhost_scsi_cmd, tvc_se_cmd); @@ -1332,11 +1333,9 @@ static void vhost_scsi_ctl_handle_kick(struct vhost_work *work) } static void -vhost_scsi_send_evt(struct vhost_scsi *vs, - struct vhost_scsi_tpg *tpg, - struct se_lun *lun, - u32 event, - u32 reason) +vhost_scsi_send_evt(struct vhost_scsi *vs, struct vhost_virtqueue *vq, + struct vhost_scsi_tpg *tpg, struct se_lun *lun, + u32 event, u32 reason) { struct vhost_scsi_evt *evt; @@ -1358,7 +1357,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs, } llist_add(&evt->list, &vs->vs_event_list); - vhost_work_queue(&vs->dev, &vs->vs_event_work); + vhost_vq_work_queue(vq, &vs->vs_event_work); } static void vhost_scsi_evt_handle_kick(struct vhost_work *work) @@ -1372,7 +1371,8 @@ static void vhost_scsi_evt_handle_kick(struct vhost_work *work) goto out; if (vs->vs_events_missed) - vhost_scsi_send_evt(vs, NULL, NULL, VIRTIO_SCSI_T_NO_EVENT, 0); + vhost_scsi_send_evt(vs, vq, NULL, NULL, VIRTIO_SCSI_T_NO_EVENT, + 0); out: mutex_unlock(&vq->mutex); } @@ -1991,7 +1991,7 @@ vhost_scsi_do_plug(struct vhost_scsi_tpg *tpg, goto unlock; if (vhost_has_feature(vq, VIRTIO_SCSI_F_HOTPLUG)) - vhost_scsi_send_evt(vs, tpg, lun, + vhost_scsi_send_evt(vs, vq, tpg, lun, VIRTIO_SCSI_T_TRANSPORT_RESET, reason); unlock: mutex_unlock(&vq->mutex); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 02/17] vhost: dynamically allocate vhost_worker
This patchset allows us to allocate multiple workers, so this has us move from the vhost_worker that's embedded in the vhost_dev to dynamically allocating it. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 66 --- drivers/vhost/vhost.h | 4 +-- 2 files changed, 45 insertions(+), 25 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 82966ffb4a5c..dfd96cf6a152 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -235,36 +235,40 @@ void vhost_dev_flush(struct vhost_dev *dev) { struct vhost_flush_struct flush; - if (dev->worker.vtsk) { - init_completion(&flush.wait_event); - vhost_work_init(&flush.work, vhost_flush_work); + init_completion(&flush.wait_event); + vhost_work_init(&flush.work, vhost_flush_work); - vhost_work_queue(dev, &flush.work); + if (vhost_work_queue(dev, &flush.work)) wait_for_completion(&flush.wait_event); - } } EXPORT_SYMBOL_GPL(vhost_dev_flush); -void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) +bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) { - if (!dev->worker.vtsk) - return; - + if (!dev->worker) + return false; + /* +* vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb +* when setting up the worker. We don't have a smp_rmb here because +* test_and_set_bit gives us a mb already. +*/ if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) { /* We can only add the work to the list after we're * sure it was not in the list. * test_and_set_bit() implies a memory barrier. */ - llist_add(&work->node, &dev->worker.work_list); - vhost_task_wake(dev->worker.vtsk); + llist_add(&work->node, &dev->worker->work_list); + vhost_task_wake(dev->worker->vtsk); } + + return true; } EXPORT_SYMBOL_GPL(vhost_work_queue); /* A lockless hint for busy polling code to exit the loop */ bool vhost_has_work(struct vhost_dev *dev) { - return !llist_empty(&dev->worker.work_list); + return !llist_empty(&dev->worker->work_list); } EXPORT_SYMBOL_GPL(vhost_has_work); @@ -458,8 +462,7 @@ void vhost_dev_init(struct vhost_dev *dev, dev->umem = NULL; dev->iotlb = NULL; dev->mm = NULL; - memset(&dev->worker, 0, sizeof(dev->worker)); - init_llist_head(&dev->worker.work_list); + dev->worker = NULL; dev->iov_limit = iov_limit; dev->weight = weight; dev->byte_weight = byte_weight; @@ -533,30 +536,47 @@ static void vhost_detach_mm(struct vhost_dev *dev) static void vhost_worker_free(struct vhost_dev *dev) { - if (!dev->worker.vtsk) + if (!dev->worker) return; - WARN_ON(!llist_empty(&dev->worker.work_list)); - vhost_task_stop(dev->worker.vtsk); - dev->worker.kcov_handle = 0; - dev->worker.vtsk = NULL; + WARN_ON(!llist_empty(&dev->worker->work_list)); + vhost_task_stop(dev->worker->vtsk); + kfree(dev->worker); + dev->worker = NULL; } static int vhost_worker_create(struct vhost_dev *dev) { + struct vhost_worker *worker; struct vhost_task *vtsk; char name[TASK_COMM_LEN]; + worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); + if (!worker) + return -ENOMEM; + snprintf(name, sizeof(name), "vhost-%d", current->pid); - vtsk = vhost_task_create(vhost_worker, &dev->worker, name); + vtsk = vhost_task_create(vhost_worker, worker, name); if (!vtsk) - return -ENOMEM; + goto free_worker; + + init_llist_head(&worker->work_list); + worker->kcov_handle = kcov_common_handle(); + worker->vtsk = vtsk; + /* +* vsock can already try to queue so make sure llist and vtsk are both +* set before vhost_work_queue sees dev->worker is set. +*/ + smp_wmb(); + dev->worker = worker; - dev->worker.kcov_handle = kcov_common_handle(); - dev->worker.vtsk = vtsk; vhost_task_start(vtsk); return 0; + +free_worker: + kfree(worker); + return -ENOMEM; } /* Caller should have device mutex */ diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index fc900be504b3..cb872cc4157a 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -44,7 +44,7 @@ struct vhost_poll { }; void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); -void vhost_w
[PATCH v8 11/17] vhost_scsi: flush IO vqs then send TMF rsp
With one worker we will always send the scsi cmd responses then send the TMF rsp, because LIO will always complete the scsi cmds first then call into us to send the TMF response. With multiple workers, the IO vq workers could be running while the TMF/ctl vq worker is running so this has us do a flush before completing the TMF to make sure cmds are completed when it's work is later queued and run. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 21 ++--- 1 file changed, 18 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 1668009bd489..2c3cf487cc71 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1133,12 +1133,27 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work *work) { struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf, vwork); - int resp_code; + struct vhost_virtqueue *ctl_vq, *vq; + int resp_code, i; + + if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) { + /* +* Flush IO vqs that don't share a worker with the ctl to make +* sure they have sent their responses before us. +*/ + ctl_vq = &tmf->vhost->vqs[VHOST_SCSI_VQ_CTL].vq; + for (i = VHOST_SCSI_VQ_IO; i < tmf->vhost->dev.nvqs; i++) { + vq = &tmf->vhost->vqs[i].vq; + + if (vhost_vq_is_setup(vq) && + vq->worker != ctl_vq->worker) + vhost_vq_flush(vq); + } - if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) resp_code = VIRTIO_SCSI_S_FUNCTION_SUCCEEDED; - else + } else { resp_code = VIRTIO_SCSI_S_FUNCTION_REJECTED; + } vhost_scsi_send_tmf_resp(tmf->vhost, &tmf->svq->vq, tmf->in_iovs, tmf->vq_desc, &tmf->resp_iov, resp_code); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 07/17] vhost: convert poll work to be vq based
This has the drivers pass in their poll to vq mapping and then converts the core poll code to use the vq based helpers. In the next patches we will allow vqs to be handled by different workers, so to allow drivers to execute operations like queue, stop, flush, etc on specific polls/vqs we need to know the mappings. Signed-off-by: Mike Christie --- drivers/vhost/net.c | 6 -- drivers/vhost/vhost.c | 8 +--- drivers/vhost/vhost.h | 4 +++- 3 files changed, 12 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 98bb957eb3b9..f2ed7167c848 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1347,8 +1347,10 @@ static int vhost_net_open(struct inode *inode, struct file *f) VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, NULL); - vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev); - vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev); + vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev, + vqs[VHOST_NET_VQ_TX]); + vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev, + vqs[VHOST_NET_VQ_RX]); f->private_data = n; n->page_frag.page = NULL; diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index d212cee064d6..9623264bb3a7 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -187,13 +187,15 @@ EXPORT_SYMBOL_GPL(vhost_work_init); /* Init poll structure */ void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, -__poll_t mask, struct vhost_dev *dev) +__poll_t mask, struct vhost_dev *dev, +struct vhost_virtqueue *vq) { init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup); init_poll_funcptr(&poll->table, vhost_poll_func); poll->mask = mask; poll->dev = dev; poll->wqh = NULL; + poll->vq = vq; vhost_work_init(&poll->work, fn); } @@ -297,7 +299,7 @@ EXPORT_SYMBOL_GPL(vhost_vq_has_work); void vhost_poll_queue(struct vhost_poll *poll) { - vhost_work_queue(poll->dev, &poll->work); + vhost_vq_work_queue(poll->vq, &poll->work); } EXPORT_SYMBOL_GPL(vhost_poll_queue); @@ -508,7 +510,7 @@ void vhost_dev_init(struct vhost_dev *dev, vhost_vq_reset(dev, vq); if (vq->handle_kick) vhost_poll_init(&vq->poll, vq->handle_kick, - EPOLLIN, dev); + EPOLLIN, dev, vq); } } EXPORT_SYMBOL_GPL(vhost_dev_init); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index ac1f4924e548..f44eecd4fcf9 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -41,13 +41,15 @@ struct vhost_poll { struct vhost_work work; __poll_tmask; struct vhost_dev*dev; + struct vhost_virtqueue *vq; }; void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, -__poll_t mask, struct vhost_dev *dev); +__poll_t mask, struct vhost_dev *dev, +struct vhost_virtqueue *vq); int vhost_poll_start(struct vhost_poll *poll, struct file *file); void vhost_poll_stop(struct vhost_poll *poll); void vhost_poll_queue(struct vhost_poll *poll); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 13/17] vhost: add helper to parse userspace vring state/file
The next patches add new vhost worker ioctls which will need to get a vhost_virtqueue from a userspace struct which specifies the vq's index. This moves the vhost_vring_ioctl code to do this to a helper so it can be shared. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 29 ++--- 1 file changed, 22 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 176d5fcd4b58..0db093dfaa22 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -599,6 +599,27 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) return NULL; } +static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp, + struct vhost_virtqueue **vq, u32 *id) +{ + u32 __user *idxp = argp; + u32 idx; + long r; + + r = get_user(idx, idxp); + if (r < 0) + return r; + + if (idx >= dev->nvqs) + return -ENOBUFS; + + idx = array_index_nospec(idx, dev->nvqs); + + *vq = dev->vqs[idx]; + *id = idx; + return 0; +} + /* Caller should have device mutex */ long vhost_dev_set_owner(struct vhost_dev *dev) { @@ -1618,21 +1639,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg struct file *eventfp, *filep = NULL; bool pollstart = false, pollstop = false; struct eventfd_ctx *ctx = NULL; - u32 __user *idxp = argp; struct vhost_virtqueue *vq; struct vhost_vring_state s; struct vhost_vring_file f; u32 idx; long r; - r = get_user(idx, idxp); + r = vhost_get_vq_from_user(d, argp, &vq, &idx); if (r < 0) return r; - if (idx >= d->nvqs) - return -ENOBUFS; - - idx = array_index_nospec(idx, d->nvqs); - vq = d->vqs[idx]; if (ioctl == VHOST_SET_VRING_NUM || ioctl == VHOST_SET_VRING_ADDR) { -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 04/17] vhost, vhost_net: add helper to check if vq has work
In the next patches each vq might have different workers so one could have work but others do not. For net, we only want to check specific vqs, so this adds a helper to check if a vq has work pending and converts vhost-net to use it. Signed-off-by: Mike Christie Acked-by: Jason Wang --- drivers/vhost/net.c | 2 +- drivers/vhost/vhost.c | 6 +++--- drivers/vhost/vhost.h | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index ae2273196b0c..98bb957eb3b9 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -546,7 +546,7 @@ static void vhost_net_busy_poll(struct vhost_net *net, endtime = busy_clock() + busyloop_timeout; while (vhost_can_busy_poll(endtime)) { - if (vhost_has_work(&net->dev)) { + if (vhost_vq_has_work(vq)) { *busyloop_intr = true; break; } diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index c78c15af97d3..a832ca692eb1 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -266,11 +266,11 @@ bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) EXPORT_SYMBOL_GPL(vhost_work_queue); /* A lockless hint for busy polling code to exit the loop */ -bool vhost_has_work(struct vhost_dev *dev) +bool vhost_vq_has_work(struct vhost_virtqueue *vq) { - return !llist_empty(&dev->worker->work_list); + return !llist_empty(&vq->worker->work_list); } -EXPORT_SYMBOL_GPL(vhost_has_work); +EXPORT_SYMBOL_GPL(vhost_vq_has_work); void vhost_poll_queue(struct vhost_poll *poll) { diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 206617edb2a9..37c183b37c42 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -45,7 +45,6 @@ struct vhost_poll { void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); -bool vhost_has_work(struct vhost_dev *dev); void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, __poll_t mask, struct vhost_dev *dev); @@ -199,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *, struct vhost_log *log, unsigned int *log_num); void vhost_discard_vq_desc(struct vhost_virtqueue *, int n); +bool vhost_vq_has_work(struct vhost_virtqueue *vq); bool vhost_vq_is_setup(struct vhost_virtqueue *vq); int vhost_vq_init_access(struct vhost_virtqueue *); int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 08/17] vhost_sock: convert to vhost_vq_work_queue
Convert from vhost_work_queue to vhost_vq_work_queue, so we can drop vhost_work_queue. Signed-off-by: Mike Christie --- drivers/vhost/vsock.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c index 6578db78f0ae..817d377a3f36 100644 --- a/drivers/vhost/vsock.c +++ b/drivers/vhost/vsock.c @@ -285,7 +285,7 @@ vhost_transport_send_pkt(struct sk_buff *skb) atomic_inc(&vsock->queued_replies); virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb); - vhost_work_queue(&vsock->dev, &vsock->send_pkt_work); + vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work); rcu_read_unlock(); return len; @@ -583,7 +583,7 @@ static int vhost_vsock_start(struct vhost_vsock *vsock) /* Some packets may have been queued before the device was started, * let's kick the send worker to send them. */ - vhost_work_queue(&vsock->dev, &vsock->send_pkt_work); + vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work); mutex_unlock(&vsock->dev.mutex); return 0; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 15/17] vhost: allow userspace to create workers
For vhost-scsi with 3 vqs or more and a workload that tries to use them in parallel like: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=3 the single vhost worker thread will become a bottlneck and we are stuck at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are used. To better utilize virtqueues and available CPUs, this patch allows userspace to create workers and bind them to vqs. You can have N workers per dev and also share N workers with M vqs on that dev. This patch adds the interface related code and the next patch will hook vhost-scsi into it. The patches do not try to hook net and vsock into the interface because: 1. multiple workers don't seem to help vsock. The problem is that with only 2 virtqueues we never fully use the existing worker when doing bidirectional tests. This seems to match vhost-scsi where we don't see the worker as a bottleneck until 3 virtqueues are used. 2. net already has a way to use multiple workers. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c| 141 ++- drivers/vhost/vhost.h| 3 + include/uapi/linux/vhost.h | 33 include/uapi/linux/vhost_types.h | 16 4 files changed, 192 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index f07f92a67e59..a4f97b56f1b4 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -623,6 +623,76 @@ static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) return NULL; } +/* Caller must have device mutex */ +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq, +struct vhost_worker *worker) +{ + if (vq->worker) + vq->worker->attachment_cnt--; + worker->attachment_cnt++; + vq->worker = worker; +} + + /* Caller must have device and virtqueue mutex */ +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq, + struct vhost_vring_worker *info) +{ + unsigned long index = info->worker_id; + struct vhost_dev *dev = vq->dev; + struct vhost_worker *worker; + + if (!dev->use_worker) + return -EINVAL; + + /* +* We only allow userspace to set a virtqueue's worker if it's not +* active and polling is not enabled. We also assume drivers +* supporting this will not be internally queueing works directly or +* via calls like vhost_dev_flush at this time. +*/ + if (vhost_vq_get_backend(vq) || vq->kick) + return -EBUSY; + + worker = xa_find(&dev->worker_xa, &index, UINT_MAX, XA_PRESENT); + if (!worker || worker->id != info->worker_id) + return -ENODEV; + + __vhost_vq_attach_worker(vq, worker); + return 0; +} + +/* Caller must have device mutex */ +static int vhost_new_worker(struct vhost_dev *dev, + struct vhost_worker_state *info) +{ + struct vhost_worker *worker; + + worker = vhost_worker_create(dev); + if (!worker) + return -ENOMEM; + + info->worker_id = worker->id; + return 0; +} + +/* Caller must have device mutex */ +static int vhost_free_worker(struct vhost_dev *dev, +struct vhost_worker_state *info) +{ + unsigned long index = info->worker_id; + struct vhost_worker *worker; + + worker = xa_find(&dev->worker_xa, &index, UINT_MAX, XA_PRESENT); + if (!worker || worker->id != info->worker_id) + return -ENODEV; + + if (worker->attachment_cnt) + return -EBUSY; + + vhost_worker_destroy(dev, worker); + return 0; +} + static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp, struct vhost_virtqueue **vq, u32 *id) { @@ -644,6 +714,75 @@ static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp, return 0; } +/* Caller must have device mutex */ +long vhost_worker_ioctl(struct vhost_dev *dev, unsigned int ioctl, + void __user *argp) +{ + struct vhost_vring_worker ring_worker; + struct vhost_worker_state state; + struct vhost_virtqueue *vq; + long ret; + u32 idx; + + if (!dev->use_worker) + return -EINVAL; + + if (!vhost_dev_has_owner(dev)) + return -EINVAL; + + switch (ioctl) { + /* dev worker ioctls */ + case VHOST_NEW_WORKER: + ret = vhost_new_worker(dev, &state); + if (!ret && copy_to_user(argp, &state, sizeof(state))) + ret = -EFAULT; + return ret; + case VHOST_FREE_WORKER: + if (copy_from_user(&state, argp, sizeof(state))) +
[PATCH v8 12/17] vhost: remove vhost_work_queue
vhost_work_queue is no longer used. Each driver is using the poll or vq based queueing, so remove vhost_work_queue. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 6 -- drivers/vhost/vhost.h | 5 ++--- 2 files changed, 2 insertions(+), 9 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 9623264bb3a7..176d5fcd4b58 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -255,12 +255,6 @@ static bool vhost_worker_queue(struct vhost_worker *worker, return true; } -bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) -{ - return vhost_worker_queue(dev->worker, work); -} -EXPORT_SYMBOL_GPL(vhost_work_queue); - bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) { return vhost_worker_queue(vq->worker, work); diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index f44eecd4fcf9..b850f534bc9a 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -44,15 +44,14 @@ struct vhost_poll { struct vhost_virtqueue *vq; }; -void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); -bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work); - void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn, __poll_t mask, struct vhost_dev *dev, struct vhost_virtqueue *vq); int vhost_poll_start(struct vhost_poll *poll, struct file *file); void vhost_poll_stop(struct vhost_poll *poll); void vhost_poll_queue(struct vhost_poll *poll); + +void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn); void vhost_dev_flush(struct vhost_dev *dev); struct vhost_log { -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 00/17] vhost: multiple worker support
The following patches were built over Linux's tree. The also apply over the mst vhost branch as well. The patcjes allow us to support multiple vhost workers tasks per device. The design is a modified version of Stefan's original idea where userspace has the kernel create a worker and we pass back the pid. In this version instead of passing the pid between user/kernel space we use a worker_id which is just an integer managed by the vhost driver and we allow userspace to create and free workers and then attach them to virtqueues at setup time. All review comments from the past reviews should be handled. If I didn't reply to a review comment, I agreed with the comment and should have handled it in this posting. Let me know if I missed one. Results: fio jobs1 2 4 8 12 16 -- 1 worker160k 488k - - - - worker per vq 160k 310k620k1182K1564K 2002k Notes: 0. This used a simple fio command: fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ --ioengine=libaio --iodepth=128 --numjobs=$JOBS_ABOVE and I used a VM with 16 vCPUs and 16 virtqueues. 1. The patches were tested with LIO's emulate_pr=0 which drops the LIO PR lock use. This was a bottleneck at around 12 vqs/jobs. 2. Because we have a hard limit of 1024 cmds, if the num jobs * iodepth was greater than 1024, I would decrease iodepth. So 12 jobs used 85 cmds, and 16 used 64. 3. The perf issue above at 2 jobs is because when we only have 1 worker we execute more cmds per vhost_work due to all vqs funneling to one worker. I have 2 patches that fix this. One is just submit from the worker thread instead of kikcing off to a workqueue like how the vhost block patches do. I'll post this after the worker support is merged. I'm also working on patches to add back batching during lio completion and do polling on the submission side. We will still want the threading patches, because if we batch at the fio level plus use the vhost theading patches, we can see a big boost like below. So hopefully doing it at the kernel will allow apps to just work without having to be smart like fio. fio using io_uring and batching with the iodepth_batch* settings: fio jobs1 2 4 8 12 16 - 1 worker494k520k- - - - worker per vq 496k878k1542k 2436k 2304k 2590k V8: - Rebase. - Move comments about assumptions so it's in the body of the code instead of top of function so it's more clear. - Added patch for RCU support and swapping workers while works are running. V7: - Add more comments about assumptions. - Drop refcounting and just use an attachment_cnt variable, so there is no confusion about when a worker is freed. - Do a opt-in model, because vsiock has an issue where it can queue works before it's running and it doesn't even need multiple workers, so there is no point in chaning the driver or core code. - Add back get worker ioctl. - Broke up last patches to make it easier to read. V6: - Rebase against vhost_task patchset. - Used xa instead of idr. V5: - Rebase against user_worker patchset. - Rebase against flush patchset. - Redo vhost-scsi tmf flush handling so it doesn't access vq->worker. V4: - fix vhost-sock VSOCK_VQ_RX use. - name functions called directly by ioctl cmd's to match the ioctl cmd. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 09/17] vhost_scsi: make SCSI cmd completion per vq
This patch separates the scsi cmd completion code paths so we can complete cmds based on their vq instead of having all cmds complete on the same worker/CPU. This will be useful with the next patches that allow us to create mulitple worker threads and bind them to different vqs, and we can have completions running on different threads/CPUs. Signed-off-by: Mike Christie Reviewed-by: Stefan Hajnoczi --- drivers/vhost/scsi.c | 56 1 file changed, 26 insertions(+), 30 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index bb10fa4bb4f6..a77c53bb035a 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -167,6 +167,7 @@ MODULE_PARM_DESC(max_io_vqs, "Set the max number of IO virtqueues a vhost scsi d struct vhost_scsi_virtqueue { struct vhost_virtqueue vq; + struct vhost_scsi *vs; /* * Reference counting for inflight reqs, used for flush operation. At * each time, one reference tracks new commands submitted, while we @@ -181,6 +182,9 @@ struct vhost_scsi_virtqueue { struct vhost_scsi_cmd *scsi_cmds; struct sbitmap scsi_tags; int max_cmds; + + struct vhost_work completion_work; + struct llist_head completion_list; }; struct vhost_scsi { @@ -190,12 +194,8 @@ struct vhost_scsi { struct vhost_dev dev; struct vhost_scsi_virtqueue *vqs; - unsigned long *compl_bitmap; struct vhost_scsi_inflight **old_inflight; - struct vhost_work vs_completion_work; /* cmd completion work item */ - struct llist_head vs_completion_list; /* cmd completion queue */ - struct vhost_work vs_event_work; /* evt injection work item */ struct llist_head vs_event_list; /* evt injection queue */ @@ -358,10 +358,11 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd) } else { struct vhost_scsi_cmd *cmd = container_of(se_cmd, struct vhost_scsi_cmd, tvc_se_cmd); - struct vhost_scsi *vs = cmd->tvc_vhost; + struct vhost_scsi_virtqueue *svq = container_of(cmd->tvc_vq, + struct vhost_scsi_virtqueue, vq); - llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list); - vhost_work_queue(&vs->dev, &vs->vs_completion_work); + llist_add(&cmd->tvc_completion_list, &svq->completion_list); + vhost_vq_work_queue(&svq->vq, &svq->completion_work); } } @@ -509,17 +510,17 @@ static void vhost_scsi_evt_work(struct vhost_work *work) */ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) { - struct vhost_scsi *vs = container_of(work, struct vhost_scsi, - vs_completion_work); + struct vhost_scsi_virtqueue *svq = container_of(work, + struct vhost_scsi_virtqueue, completion_work); struct virtio_scsi_cmd_resp v_rsp; struct vhost_scsi_cmd *cmd, *t; struct llist_node *llnode; struct se_cmd *se_cmd; struct iov_iter iov_iter; - int ret, vq; + bool signal = false; + int ret; - bitmap_zero(vs->compl_bitmap, vs->dev.nvqs); - llnode = llist_del_all(&vs->vs_completion_list); + llnode = llist_del_all(&svq->completion_list); llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) { se_cmd = &cmd->tvc_se_cmd; @@ -539,21 +540,17 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) cmd->tvc_in_iovs, sizeof(v_rsp)); ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter); if (likely(ret == sizeof(v_rsp))) { - struct vhost_scsi_virtqueue *q; + signal = true; + vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0); - q = container_of(cmd->tvc_vq, struct vhost_scsi_virtqueue, vq); - vq = q - vs->vqs; - __set_bit(vq, vs->compl_bitmap); } else pr_err("Faulted on virtio_scsi_cmd_resp\n"); vhost_scsi_release_cmd_res(se_cmd); } - vq = -1; - while ((vq = find_next_bit(vs->compl_bitmap, vs->dev.nvqs, vq + 1)) - < vs->dev.nvqs) - vhost_signal(&vs->dev, &vs->vqs[vq].vq); + if (signal) + vhost_signal(&svq->vs->dev, &svq->vq); } static struct vhost_scsi_cmd * @@ -1770,6 +1767,7 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features) static int vhost_scsi_open(struct inode *inode, struct file *f) { + struct vhost_scsi_virtqueue *svq;
[PATCH v8 03/17] vhost: add vhost_worker pointer to vhost_virtqueue
This patchset allows userspace to map vqs to different workers. This patch adds a worker pointer to the vq so in later patches in this set we can queue/flush specific vqs and their workers. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 21 ++--- drivers/vhost/vhost.h | 1 + 2 files changed, 15 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index dfd96cf6a152..c78c15af97d3 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -479,6 +479,7 @@ void vhost_dev_init(struct vhost_dev *dev, vq->log = NULL; vq->indirect = NULL; vq->heads = NULL; + vq->worker = NULL; vq->dev = dev; mutex_init(&vq->mutex); vhost_vq_reset(dev, vq); @@ -545,7 +546,7 @@ static void vhost_worker_free(struct vhost_dev *dev) dev->worker = NULL; } -static int vhost_worker_create(struct vhost_dev *dev) +static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev) { struct vhost_worker *worker; struct vhost_task *vtsk; @@ -553,7 +554,7 @@ static int vhost_worker_create(struct vhost_dev *dev) worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); if (!worker) - return -ENOMEM; + return NULL; snprintf(name, sizeof(name), "vhost-%d", current->pid); @@ -572,17 +573,18 @@ static int vhost_worker_create(struct vhost_dev *dev) dev->worker = worker; vhost_task_start(vtsk); - return 0; + return worker; free_worker: kfree(worker); - return -ENOMEM; + return NULL; } /* Caller should have device mutex */ long vhost_dev_set_owner(struct vhost_dev *dev) { - int err; + struct vhost_worker *worker; + int err, i; /* Is there an owner already? */ if (vhost_dev_has_owner(dev)) { @@ -603,9 +605,14 @@ long vhost_dev_set_owner(struct vhost_dev *dev) * below since we don't have to worry about vsock queueing * while we free the worker. */ - err = vhost_worker_create(dev); - if (err) + worker = vhost_worker_create(dev); + if (!worker) { + err = -ENOMEM; goto err_worker; + } + + for (i = 0; i < dev->nvqs; i++) + dev->vqs[i]->worker = worker; } return 0; diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index cb872cc4157a..206617edb2a9 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -74,6 +74,7 @@ struct vhost_vring_call { /* The virtqueue structure describes a queue attached to a device. */ struct vhost_virtqueue { struct vhost_dev *dev; + struct vhost_worker *worker; /* The actual ring of buffers. */ struct mutex mutex; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 06/17] vhost: take worker or vq for flushing
This patch has the core work flush function take a worker. When we support multiple workers we can then flush each worker during device removal, stoppage, etc. It also adds a helper to flush specific virtqueues, so vhost-scsi can flush IO vqs from it's ctl vq. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 15 +-- drivers/vhost/vhost.h | 1 + 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 16630c19bcc2..d212cee064d6 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -265,16 +265,27 @@ bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) } EXPORT_SYMBOL_GPL(vhost_vq_work_queue); -void vhost_dev_flush(struct vhost_dev *dev) +static void vhost_worker_flush(struct vhost_worker *worker) { struct vhost_flush_struct flush; init_completion(&flush.wait_event); vhost_work_init(&flush.work, vhost_flush_work); - if (vhost_work_queue(dev, &flush.work)) + if (vhost_worker_queue(worker, &flush.work)) wait_for_completion(&flush.wait_event); } + +void vhost_vq_flush(struct vhost_virtqueue *vq) +{ + vhost_worker_flush(vq->worker); +} +EXPORT_SYMBOL_GPL(vhost_vq_flush); + +void vhost_dev_flush(struct vhost_dev *dev) +{ + vhost_worker_flush(dev->worker); +} EXPORT_SYMBOL_GPL(vhost_dev_flush); /* A lockless hint for busy polling code to exit the loop */ diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 6a1ae8ae9c7d..ac1f4924e548 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -198,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *, struct vhost_log *log, unsigned int *log_num); void vhost_discard_vq_desc(struct vhost_virtqueue *, int n); +void vhost_vq_flush(struct vhost_virtqueue *vq); bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work); bool vhost_vq_has_work(struct vhost_virtqueue *vq); bool vhost_vq_is_setup(struct vhost_virtqueue *vq); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 05/17] vhost: take worker or vq instead of dev for queueing
This patch has the core work queueing function take a worker for when we support multiple workers. It also adds a helper that takes a vq during queueing so modules can control which vq/worker to queue work on. This temp leaves vhost_work_queue. It will be removed when the drivers are converted in the next patches. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 44 +++ drivers/vhost/vhost.h | 1 + 2 files changed, 29 insertions(+), 16 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a832ca692eb1..16630c19bcc2 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -231,21 +231,10 @@ void vhost_poll_stop(struct vhost_poll *poll) } EXPORT_SYMBOL_GPL(vhost_poll_stop); -void vhost_dev_flush(struct vhost_dev *dev) +static bool vhost_worker_queue(struct vhost_worker *worker, + struct vhost_work *work) { - struct vhost_flush_struct flush; - - init_completion(&flush.wait_event); - vhost_work_init(&flush.work, vhost_flush_work); - - if (vhost_work_queue(dev, &flush.work)) - wait_for_completion(&flush.wait_event); -} -EXPORT_SYMBOL_GPL(vhost_dev_flush); - -bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) -{ - if (!dev->worker) + if (!worker) return false; /* * vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb @@ -257,14 +246,37 @@ bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) * sure it was not in the list. * test_and_set_bit() implies a memory barrier. */ - llist_add(&work->node, &dev->worker->work_list); - vhost_task_wake(dev->worker->vtsk); + llist_add(&work->node, &worker->work_list); + vhost_task_wake(worker->vtsk); } return true; } + +bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) +{ + return vhost_worker_queue(dev->worker, work); +} EXPORT_SYMBOL_GPL(vhost_work_queue); +bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work) +{ + return vhost_worker_queue(vq->worker, work); +} +EXPORT_SYMBOL_GPL(vhost_vq_work_queue); + +void vhost_dev_flush(struct vhost_dev *dev) +{ + struct vhost_flush_struct flush; + + init_completion(&flush.wait_event); + vhost_work_init(&flush.work, vhost_flush_work); + + if (vhost_work_queue(dev, &flush.work)) + wait_for_completion(&flush.wait_event); +} +EXPORT_SYMBOL_GPL(vhost_dev_flush); + /* A lockless hint for busy polling code to exit the loop */ bool vhost_vq_has_work(struct vhost_virtqueue *vq) { diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 37c183b37c42..6a1ae8ae9c7d 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -198,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *, struct vhost_log *log, unsigned int *log_num); void vhost_discard_vq_desc(struct vhost_virtqueue *, int n); +bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work); bool vhost_vq_has_work(struct vhost_virtqueue *vq); bool vhost_vq_is_setup(struct vhost_virtqueue *vq); int vhost_vq_init_access(struct vhost_virtqueue *); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v8 01/17] vhost: create worker at end of vhost_dev_set_owner
vsock can start queueing work after VHOST_VSOCK_SET_GUEST_CID, so after we have called vhost_worker_create it can be calling vhost_work_queue and trying to access the vhost worker/task. If vhost_dev_alloc_iovecs fails, then vhost_worker_free could free the worker/task from under vsock. This moves vhost_worker_create to the end of vhost_dev_set_owner where we know we can no longer fail in that path. If it fails after the VHOST_SET_OWNER and userspace closes the device, then the normal vsock release handling will do the right thing. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 19 +-- 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 60c9ebd629dd..82966ffb4a5c 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -572,20 +572,27 @@ long vhost_dev_set_owner(struct vhost_dev *dev) vhost_attach_mm(dev); + err = vhost_dev_alloc_iovecs(dev); + if (err) + goto err_iovecs; + if (dev->use_worker) { + /* +* This should be done last, because vsock can queue work +* before VHOST_SET_OWNER so it simplifies the failure path +* below since we don't have to worry about vsock queueing +* while we free the worker. +*/ err = vhost_worker_create(dev); if (err) goto err_worker; } - err = vhost_dev_alloc_iovecs(dev); - if (err) - goto err_iovecs; - return 0; -err_iovecs: - vhost_worker_free(dev); + err_worker: + vhost_dev_free_iovecs(dev); +err_iovecs: vhost_detach_mm(dev); err_mm: return err; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: vhost: Fix vhost_task regressions in 6.4-rc
On 6/7/23 3:22 PM, Michael S. Tsirkin wrote: > On Wed, Jun 07, 2023 at 02:23:36PM -0500, Mike Christie wrote: >> The following patches were made over Linus's tree which contains a >> vhost change missing in mst's vhost branch. These patches fix two >> issues caused by the vhost_task patches: >> 1. I was setting dev->worker too early and this caused crashes when >> vsock would queue work before VHOST_SET_OWNER. >> >> 2. The patch that Linus's tree contains which vhost does not yet >> have converted vhost_tasks to use CLONE_THREAD. That required a >> change to how we set the task state, but I completely messed up >> and moved when we set ourself to interruptible too late. >> > > Right. and that's in rc5 yes? > Yes. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
vhost: Fix vhost_task regressions in 6.4-rc
The following patches were made over Linus's tree which contains a vhost change missing in mst's vhost branch. These patches fix two issues caused by the vhost_task patches: 1. I was setting dev->worker too early and this caused crashes when vsock would queue work before VHOST_SET_OWNER. 2. The patch that Linus's tree contains which vhost does not yet have converted vhost_tasks to use CLONE_THREAD. That required a change to how we set the task state, but I completely messed up and moved when we set ourself to interruptible too late. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 2/2] vhost: Fix worker hangs due to missed wake up calls
We can race where we have added work to the work_list, but vhost_task_fn has passed that check but not yet set us into TASK_INTERRUPTIBLE. wake_up_process will see us in TASK_RUNNING and just return. This bug was intoduced in commit f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") when I moved the setting of TASK_INTERRUPTIBLE to simplfy the code and avoid get_signal from logging warnings about being in the wrong state. This moves the setting of TASK_INTERRUPTIBLE back to before we test if we need to stop the task to avoid a possible race there as well. We then have vhost_worker set TASK_RUNNING if it finds work similar to before. Fixes: f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps regression") Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 2 ++ kernel/vhost_task.c | 16 +--- 2 files changed, 11 insertions(+), 7 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 7a9f93eae225..b2722d29e069 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -341,6 +341,8 @@ static bool vhost_worker(void *data) node = llist_del_all(&worker->work_list); if (node) { + __set_current_state(TASK_RUNNING); + node = llist_reverse_order(node); /* make sure flag is seen after deletion */ smp_wmb(); diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index f80d5c51ae67..da35e5b7f047 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -28,10 +28,6 @@ static int vhost_task_fn(void *data) for (;;) { bool did_work; - /* mb paired w/ vhost_task_stop */ - if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) - break; - if (!dead && signal_pending(current)) { struct ksignal ksig; /* @@ -48,11 +44,17 @@ static int vhost_task_fn(void *data) clear_thread_flag(TIF_SIGPENDING); } + /* mb paired w/ vhost_task_stop */ + set_current_state(TASK_INTERRUPTIBLE); + + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) { + __set_current_state(TASK_RUNNING); + break; + } + did_work = vtsk->fn(vtsk->data); - if (!did_work) { - set_current_state(TASK_INTERRUPTIBLE); + if (!did_work) schedule(); - } } complete(&vtsk->exited); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/2] vhost: Fix crash during early vhost_transport_send_pkt calls
If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we can race where: 1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue 2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create. 3. vhost_worker_create will set the dev->worker pointer before setting the worker->vtsk pointer. 4. thread0's vhost_work_queue will see the dev->worker pointer is set and try to call vhost_task_wake using not yet set worker->vtsk pointer. 5. We then crash since vtsk is NULL. Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker threads"), we only had the worker pointer so we could just check it to see if VHOST_SET_OWNER has been done. After that commit we have the vhost_worker and vhost_task pointer, so we can now hit the bug above. This patch embeds the vhost_worker in the vhost_dev and moves the work list initialization back to vhost_dev_init, so we can just check the worker.vtsk pointer to check if VHOST_SET_OWNER has been done like before. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 50 +++ drivers/vhost/vhost.h | 2 +- 2 files changed, 18 insertions(+), 34 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 074273020849..7a9f93eae225 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -235,7 +235,7 @@ void vhost_dev_flush(struct vhost_dev *dev) { struct vhost_flush_struct flush; - if (dev->worker) { + if (dev->worker.vtsk) { init_completion(&flush.wait_event); vhost_work_init(&flush.work, vhost_flush_work); @@ -247,7 +247,7 @@ EXPORT_SYMBOL_GPL(vhost_dev_flush); void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) { - if (!dev->worker) + if (!dev->worker.vtsk) return; if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) { @@ -255,8 +255,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) * sure it was not in the list. * test_and_set_bit() implies a memory barrier. */ - llist_add(&work->node, &dev->worker->work_list); - vhost_task_wake(dev->worker->vtsk); + llist_add(&work->node, &dev->worker.work_list); + vhost_task_wake(dev->worker.vtsk); } } EXPORT_SYMBOL_GPL(vhost_work_queue); @@ -264,7 +264,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue); /* A lockless hint for busy polling code to exit the loop */ bool vhost_has_work(struct vhost_dev *dev) { - return dev->worker && !llist_empty(&dev->worker->work_list); + return !llist_empty(&dev->worker.work_list); } EXPORT_SYMBOL_GPL(vhost_has_work); @@ -456,7 +456,8 @@ void vhost_dev_init(struct vhost_dev *dev, dev->umem = NULL; dev->iotlb = NULL; dev->mm = NULL; - dev->worker = NULL; + memset(&dev->worker, 0, sizeof(dev->worker)); + init_llist_head(&dev->worker.work_list); dev->iov_limit = iov_limit; dev->weight = weight; dev->byte_weight = byte_weight; @@ -530,47 +531,30 @@ static void vhost_detach_mm(struct vhost_dev *dev) static void vhost_worker_free(struct vhost_dev *dev) { - struct vhost_worker *worker = dev->worker; - - if (!worker) + if (!dev->worker.vtsk) return; - dev->worker = NULL; - WARN_ON(!llist_empty(&worker->work_list)); - vhost_task_stop(worker->vtsk); - kfree(worker); + WARN_ON(!llist_empty(&dev->worker.work_list)); + vhost_task_stop(dev->worker.vtsk); + dev->worker.kcov_handle = 0; + dev->worker.vtsk = NULL; } static int vhost_worker_create(struct vhost_dev *dev) { - struct vhost_worker *worker; struct vhost_task *vtsk; char name[TASK_COMM_LEN]; - int ret; - - worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); - if (!worker) - return -ENOMEM; - dev->worker = worker; - worker->kcov_handle = kcov_common_handle(); - init_llist_head(&worker->work_list); snprintf(name, sizeof(name), "vhost-%d", current->pid); - vtsk = vhost_task_create(vhost_worker, worker, name); - if (!vtsk) { - ret = -ENOMEM; - goto free_worker; - } + vtsk = vhost_task_create(vhost_worker, &dev->worker, name); + if (!vtsk) + return -ENOMEM; - worker->vtsk = vtsk; + dev->worker.kcov_handle = kcov_common_handle(); + dev->worker.vtsk = vtsk; vhost_task_start(vtsk); return 0; - -free_worker: - kfree(worker); - dev->worker
Re: [PATCH 1/1] vhost: Fix crash during early vhost_transport_send_pkt calls
On 6/6/23 2:22 PM, Michael S. Tsirkin wrote: > On Tue, Jun 06, 2023 at 12:19:10PM -0500, Mike Christie wrote: >> On 6/6/23 4:49 AM, Stefano Garzarella wrote: >>> On Mon, Jun 05, 2023 at 01:57:30PM -0500, Mike Christie wrote: >>>> If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we >>>> can race where: >>>> 1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue >>>> 2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create. >>>> 3. vhost_worker_create will set the dev->worker pointer before setting >>>> the worker->vtsk pointer. >>>> 4. thread0's vhost_work_queue will see the dev->worker pointer is >>>> set and try to call vhost_task_wake using not yet set worker->vtsk >>>> pointer. >>>> 5. We then crash since vtsk is NULL. >>>> >>>> Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker >>>> threads"), we only had the worker pointer so we could just check it to >>>> see if VHOST_SET_OWNER has been done. After that commit we have the >>>> vhost_worker and vhost_task pointers, so we can now hit the bug above. >>>> >>>> This patch embeds the vhost_worker in the vhost_dev, so we can just >>>> check the worker.vtsk pointer to check if VHOST_SET_OWNER has been done >>>> like before. >>>> >>>> Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") >>> >>> We should add: >>> >>> Reported-by: syzbot+d0d442c22fa8db45f...@syzkaller.appspotmail.com >> >> >> Ok. Will do. >> >> >>>> - } >>>> + vtsk = vhost_task_create(vhost_worker, &dev->worker, name); >>>> + if (!vtsk) >>>> + return -ENOMEM; >>>> >>>> - worker->vtsk = vtsk; >>>> + dev->worker.kcov_handle = kcov_common_handle(); >>>> + dev->worker.vtsk = vtsk; >>> >>> vhost_work_queue() is called by vhost_transport_send_pkt() without >>> holding vhost_dev.mutex (like vhost_poll_queue() in several places). >>> >>> If vhost_work_queue() finds dev->worker.vtsk not NULL, how can we >>> be sure that for example `work_list` has been initialized? >>> >>> Maybe I'm overthinking since we didn't have this problem before or the >>> race is really short that it never happened. >> >> Yeah, I dropped the READ/WRITE_ONCE use because I didn't think we needed >> it for the vhost_vsock_start case, and for the case you mentioned above, I >> wondered the same thing as you but was not sure so I added the same >> behavior as before. When I read memory-barriers.txt, it sounds like we've >> been getting lucky. > > Yea READ/WRITE_ONCE is one of these things. Once you start adding > them it's hard to stop, you begin to think it's needed everywhere. > To actually know you need a language lawyer (READ/WRITE_ONCE > is a compiler thing not a CPU thing). I am worried about the compiler reordering when init_llist_head sets the llist->first pointer to NULL and when we set the vtsk pointer. If they are in reverse order, then vhost_work_queue could detect there is a vtsk, but then after we add work to the work_list we could set llist->first to NULL. However, to avoid this type of thing we need to have init_llist_head use WRITE_ONCE right as well as vhost_worker_create when it sets vtsk in the patch. One other thing is that we could just move the init_llist_head to vhost_dev_init like it was before. We should be safe like before. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 6/6/23 2:39 PM, Oleg Nesterov wrote: > On 06/06, Mike Christie wrote: >> >> On 6/6/23 7:16 AM, Oleg Nesterov wrote: >>> On 06/05, Mike Christie wrote: >>> >>>> So it works like if we were using a kthread still: >>>> >>>> 1. Userapce thread0 opens /dev/vhost-$something. >>>> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to >>>> create the task_struct which runs the vhost_worker() function which handles >>>> the work->fns. >>>> 3. If userspace now does a SIGKILL or just exits without doing a close() on >>>> /dev/vhost-$something, then when thread0 does exit_files() that will do the >>>> fput that does vhost-$something's file_operations->release. >>> >>> So, at least in this simple case vhost_worker() can just exit after SIGKILL, >>> and thread0 can flush the outstanding commands when it calls >>> vhost_dev_flush() >>> rather than wait for vhost_worker(). >>> >>> Right? >> >> With the current code, the answer is no. We would hang like I mentioned here: >> >> https://lore.kernel.org/lkml/ae250076-7d55-c407-1066-86b37014c...@oracle.com/ > > If only I could fully understand this email ;) > > Could you spell to explain why this can't work (again, in this simple case) ? > > My current (and I know, very poor) understanding is that .release() should > roughly do the following: > > 1. Ensure that vhost_work_queue() can't add the new callbacks > > 2. Call vhost_dev_flush() to ensure that worker->work_list is empty > The problem is what do we do in the work->fn. What you wrote is correct for cleaning up the work_list. However, the lower level vhost drivers, like vhost-scsi, will do something like: async_submit_request_to_storage/net_layer() from their work->fn. The submission is async so when the request completes it calls some callbacks that call into the vhost driver and vhost layer. For vhost-scsi the call back will run vhost_queue_work so we can complete the request from the vhost_task. So if we've already run the work->fn then we need to add code to handle the completion of the request we submitted. We need: 1. vhost_queue_work needs some code to detect when the vhost_task has exited so we don't do vhost_task_wake on a freed task. I was saying for this, we can sprinkle some RCU in there and in the code paths we cleanup the vhost_task. 2. The next problem is that if the vhost_task is going to just loop over the work_list and kill those works before it exits (or if we do it from the vhost_dev_flush function), then we still have handle those async requests that got kicked off to some other layer that are going to eventually complete and try to call vhost_work_queue. With #1, we can detect when the vhost_task is no longer usable, so we then need to modify the drivers to detect that and instead of trying to execute like normal where they queue the work, they just take their failure paths and free resources. So the release cabllback was doing 2 things: 1. Flushing the work_list 2. Waiting on the those request completions And so I was saying before I'm trying to finish up handling #2. I hit some hiccups though because it turns out there is at least one case where we don't have a vhost_task but we don't want to fail. It's just a matter of coding it though. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/1] vhost: Fix crash during early vhost_transport_send_pkt calls
On 6/6/23 4:49 AM, Stefano Garzarella wrote: > On Mon, Jun 05, 2023 at 01:57:30PM -0500, Mike Christie wrote: >> If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we >> can race where: >> 1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue >> 2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create. >> 3. vhost_worker_create will set the dev->worker pointer before setting >> the worker->vtsk pointer. >> 4. thread0's vhost_work_queue will see the dev->worker pointer is >> set and try to call vhost_task_wake using not yet set worker->vtsk >> pointer. >> 5. We then crash since vtsk is NULL. >> >> Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker >> threads"), we only had the worker pointer so we could just check it to >> see if VHOST_SET_OWNER has been done. After that commit we have the >> vhost_worker and vhost_task pointers, so we can now hit the bug above. >> >> This patch embeds the vhost_worker in the vhost_dev, so we can just >> check the worker.vtsk pointer to check if VHOST_SET_OWNER has been done >> like before. >> >> Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") > > We should add: > > Reported-by: syzbot+d0d442c22fa8db45f...@syzkaller.appspotmail.com Ok. Will do. >> - } >> + vtsk = vhost_task_create(vhost_worker, &dev->worker, name); >> + if (!vtsk) >> + return -ENOMEM; >> >> - worker->vtsk = vtsk; >> + dev->worker.kcov_handle = kcov_common_handle(); >> + dev->worker.vtsk = vtsk; > > vhost_work_queue() is called by vhost_transport_send_pkt() without > holding vhost_dev.mutex (like vhost_poll_queue() in several places). > > If vhost_work_queue() finds dev->worker.vtsk not NULL, how can we > be sure that for example `work_list` has been initialized? > > Maybe I'm overthinking since we didn't have this problem before or the > race is really short that it never happened. Yeah, I dropped the READ/WRITE_ONCE use because I didn't think we needed it for the vhost_vsock_start case, and for the case you mentioned above, I wondered the same thing as you but was not sure so I added the same behavior as before. When I read memory-barriers.txt, it sounds like we've been getting lucky. I'll add back the READ/WRITE_ONCE for vtsk access since that's what we are keying off as signaling that the worker is ready to be used. I didn't see any type of perf hit when using it, and from the memory-barriers.txt doc it sounds like that's what we should be doing. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 6/6/23 7:16 AM, Oleg Nesterov wrote: > On 06/05, Mike Christie wrote: >> >> On 6/5/23 10:10 AM, Oleg Nesterov wrote: >>> On 06/03, michael.chris...@oracle.com wrote: >>>> >>>> On 6/2/23 11:15 PM, Eric W. Biederman wrote: >>>> The problem is that as part of the flush the drivers/vhost/scsi.c code >>>> will wait for outstanding commands, because we can't free the device and >>>> it's resources before the commands complete or we will hit the accessing >>>> freed memory bug. >>> >>> ignoring send-fd/clone issues, can we assume that the final fput/release >>> should always come from vhost_worker's sub-thread (which shares mm/etc) ? >> >> I think I'm misunderstanding the sub-thread term. >> >> - Is it the task_struct's context that we did the >> kernel/vhost_taskc.c:vhost_task_create() from? Below it would be the >> thread we did VHOST_SET_OWNER from. > > Yes, > >> So it works like if we were using a kthread still: >> >> 1. Userapce thread0 opens /dev/vhost-$something. >> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to >> create the task_struct which runs the vhost_worker() function which handles >> the work->fns. >> 3. If userspace now does a SIGKILL or just exits without doing a close() on >> /dev/vhost-$something, then when thread0 does exit_files() that will do the >> fput that does vhost-$something's file_operations->release. > > So, at least in this simple case vhost_worker() can just exit after SIGKILL, > and thread0 can flush the outstanding commands when it calls vhost_dev_flush() > rather than wait for vhost_worker(). > > Right? With the current code, the answer is no. We would hang like I mentioned here: https://lore.kernel.org/lkml/ae250076-7d55-c407-1066-86b37014c...@oracle.com/ We need to add code like I mentioned in that reply because we don't have a way to call into the layers below us to flush those commands. We need more like an abort and don't call back into us type of operation. Or, I'm just trying to add a check where we detect what happened then instead of trying to use the vhost_task we try to complete in the context the lower level completes us in. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/1] vhost: Fix crash during early vhost_transport_send_pkt calls
If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we can race where: 1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue 2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create. 3. vhost_worker_create will set the dev->worker pointer before setting the worker->vtsk pointer. 4. thread0's vhost_work_queue will see the dev->worker pointer is set and try to call vhost_task_wake using not yet set worker->vtsk pointer. 5. We then crash since vtsk is NULL. Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker threads"), we only had the worker pointer so we could just check it to see if VHOST_SET_OWNER has been done. After that commit we have the vhost_worker and vhost_task pointers, so we can now hit the bug above. This patch embeds the vhost_worker in the vhost_dev, so we can just check the worker.vtsk pointer to check if VHOST_SET_OWNER has been done like before. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 50 +++ drivers/vhost/vhost.h | 2 +- 2 files changed, 18 insertions(+), 34 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 074273020849..0ad9fea7c170 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -235,7 +235,7 @@ void vhost_dev_flush(struct vhost_dev *dev) { struct vhost_flush_struct flush; - if (dev->worker) { + if (dev->worker.vtsk) { init_completion(&flush.wait_event); vhost_work_init(&flush.work, vhost_flush_work); @@ -247,7 +247,7 @@ EXPORT_SYMBOL_GPL(vhost_dev_flush); void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) { - if (!dev->worker) + if (!dev->worker.vtsk) return; if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) { @@ -255,8 +255,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) * sure it was not in the list. * test_and_set_bit() implies a memory barrier. */ - llist_add(&work->node, &dev->worker->work_list); - vhost_task_wake(dev->worker->vtsk); + llist_add(&work->node, &dev->worker.work_list); + vhost_task_wake(dev->worker.vtsk); } } EXPORT_SYMBOL_GPL(vhost_work_queue); @@ -264,7 +264,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue); /* A lockless hint for busy polling code to exit the loop */ bool vhost_has_work(struct vhost_dev *dev) { - return dev->worker && !llist_empty(&dev->worker->work_list); + return !llist_empty(&dev->worker.work_list); } EXPORT_SYMBOL_GPL(vhost_has_work); @@ -456,7 +456,7 @@ void vhost_dev_init(struct vhost_dev *dev, dev->umem = NULL; dev->iotlb = NULL; dev->mm = NULL; - dev->worker = NULL; + memset(&dev->worker, 0, sizeof(dev->worker)); dev->iov_limit = iov_limit; dev->weight = weight; dev->byte_weight = byte_weight; @@ -530,47 +530,31 @@ static void vhost_detach_mm(struct vhost_dev *dev) static void vhost_worker_free(struct vhost_dev *dev) { - struct vhost_worker *worker = dev->worker; - - if (!worker) + if (!dev->worker.vtsk) return; - dev->worker = NULL; - WARN_ON(!llist_empty(&worker->work_list)); - vhost_task_stop(worker->vtsk); - kfree(worker); + WARN_ON(!llist_empty(&dev->worker.work_list)); + vhost_task_stop(dev->worker.vtsk); + dev->worker.kcov_handle = 0; + dev->worker.vtsk = NULL; } static int vhost_worker_create(struct vhost_dev *dev) { - struct vhost_worker *worker; struct vhost_task *vtsk; char name[TASK_COMM_LEN]; - int ret; - - worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); - if (!worker) - return -ENOMEM; - dev->worker = worker; - worker->kcov_handle = kcov_common_handle(); - init_llist_head(&worker->work_list); + init_llist_head(&dev->worker.work_list); snprintf(name, sizeof(name), "vhost-%d", current->pid); - vtsk = vhost_task_create(vhost_worker, worker, name); - if (!vtsk) { - ret = -ENOMEM; - goto free_worker; - } + vtsk = vhost_task_create(vhost_worker, &dev->worker, name); + if (!vtsk) + return -ENOMEM; - worker->vtsk = vtsk; + dev->worker.kcov_handle = kcov_common_handle(); + dev->worker.vtsk = vtsk; vhost_task_start(vtsk); return 0; - -free_worker: - kfree(worker); - dev->worker = NULL; - return ret; } /* Caller shoul
Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 6/5/23 10:10 AM, Oleg Nesterov wrote: > On 06/03, michael.chris...@oracle.com wrote: >> >> On 6/2/23 11:15 PM, Eric W. Biederman wrote: >> The problem is that as part of the flush the drivers/vhost/scsi.c code >> will wait for outstanding commands, because we can't free the device and >> it's resources before the commands complete or we will hit the accessing >> freed memory bug. > > ignoring send-fd/clone issues, can we assume that the final fput/release > should always come from vhost_worker's sub-thread (which shares mm/etc) ? I think I'm misunderstanding the sub-thread term. - Is it the task_struct's context that we did the kernel/vhost_taskc.c:vhost_task_create() from? Below it would be the thread we did VHOST_SET_OWNER from. If so, then yes. - Is it the task_struct that gets created by kernel/vhost_taskc.c:vhost_task_create()? If so, then the answer is no. vhost_task_create has set the no_files arg on kernel_clone_args, so copy_files() sets task_struct->files to NULL and we don't clone or dup the files. So it works like if we were using a kthread still: 1. Userapce thread0 opens /dev/vhost-$something. 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to create the task_struct which runs the vhost_worker() function which handles the work->fns. 3. If userspace now does a SIGKILL or just exits without doing a close() on /dev/vhost-$something, then when thread0 does exit_files() that will do the fput that does vhost-$something's file_operations->release. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them. To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be supported. This is a modified version of the patch written by Mike Christie which was a modified version of patch originally written by Linus. Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER. Including ignoring signals, setting up the register state, and having get_signal return instead of calling do_group_exit. Tidied up the vhost_task abstraction so that the definition of vhost_task only needs to be visible inside of vhost_task.c. Making it easier to review the code and tell what needs to be done where. As part of this the main loop has been moved from vhost_worker into vhost_task_fn. vhost_worker now returns true if work was done. The main loop has been updated to call get_signal which handles SIGSTOP, freezing, and collects the message that tells the thread to exit as part of process exit. This collection clears __fatal_signal_pending. This collection is not guaranteed to clear signal_pending() so clear that explicitly so the schedule() sleeps. For now the vhost thread continues to exist and run work until the last file descriptor is closed and the release function is called as part of freeing struct file. To avoid hangs in the coredump rendezvous and when killing threads in a multi-threaded exec. The coredump code and de_thread have been modified to ignore vhost threads. Remvoing the special case for exec appears to require teaching vhost_dev_flush how to directly complete transactions in case the vhost thread is no longer running. Removing the special case for coredump rendezvous requires either the above fix needed for exec or moving the coredump rendezvous into get_signal. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Eric W. Biederman Co-developed-by: Mike Christie Signed-off-by: Mike Christie --- arch/x86/include/asm/fpu/sched.h | 2 +- arch/x86/kernel/fpu/context.h| 2 +- arch/x86/kernel/fpu/core.c | 2 +- drivers/vhost/vhost.c| 22 ++-- fs/coredump.c| 4 +- include/linux/sched/task.h | 1 - include/linux/sched/vhost_task.h | 15 ++ kernel/exit.c| 5 +- kernel/fork.c| 13 ++--- kernel/signal.c | 8 +-- kernel/vhost_task.c | 92 +--- 11 files changed, 89 insertions(+), 77 deletions(-) diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h index c2d6cd78ed0c..78fcde7b1f07 100644 --- a/arch/x86/include/asm/fpu/sched.h +++ b/arch/x86/include/asm/fpu/sched.h @@ -39,7 +39,7 @@ extern void fpu_flush_thread(void); static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu) { if (cpu_feature_enabled(X86_FEATURE_FPU) && - !(current->flags & (PF_KTHREAD | PF_IO_WORKER))) { + !(current->flags & (PF_KTHREAD | PF_USER_WORKER))) { save_fpregs_to_fpstate(old_fpu); /* * The save operation preserved register state, so the diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h index 9fcfa5c4dad7..af5cbdd9bd29 100644 --- a/arch/x86/kernel/fpu/context.h +++ b/arch/x86/kernel/fpu/context.h @@ -57,7 +57,7 @@ static inline void fpregs_restore_userregs(void) struct fpu *fpu = ¤t->thread.fpu; int cpu = smp_processor_id(); - if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_IO_WORKER))) + if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_USER_WORKER))) return; if (!fpregs_state_valid(fpu, cpu)) { diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index caf33486dc5e..1015af1ae562 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -426,7 +426,7 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask) this_cpu_write(in_kernel_fpu, true); - if (!(current->flags & (PF_KTHREAD | PF_IO_WORKER)) && + if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) && !test_thread_flag(TIF_NEED_FPU_LOAD)) { set_thread_flag(TIF_NEED_FPU_LOAD); save_fpregs_to_fpstate(¤t->thread.fpu); diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..074273020849 100644 --- a/d
Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue
On 6/1/23 2:47 AM, Stefano Garzarella wrote: >> >> static void vhost_worker_free(struct vhost_dev *dev) >> { >> - struct vhost_worker *worker = dev->worker; >> + struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk); >> >> - if (!worker) >> + if (!vtsk) >> return; >> >> - dev->worker = NULL; >> - WARN_ON(!llist_empty(&worker->work_list)); >> - vhost_task_stop(worker->vtsk); >> - kfree(worker); >> + vhost_task_stop(vtsk); >> + WARN_ON(!llist_empty(&dev->worker.work_list)); >> + WRITE_ONCE(dev->worker.vtsk, NULL); > > The patch LGTM, I just wonder if we should set dev->worker to zero here, We might want to just set kcov_handle to zero for now. In 6.3 and older, I think we could do: 1. vhost_dev_set_owner could successfully set dev->worker. 2. vhost_transport_send_pkt runs vhost_work_queue and sees worker is set and adds the vhost_work to the work_list. 3. vhost_dev_set_owner fails in vhost_attach_cgroups, so we stop the worker before the work can be run and set worker to NULL. 4. We clear kcov_handle and return. We leave the work on the work_list. 5. Userspace can then retry vhost_dev_set_owner. If that works, then the work gets executed ok eventually. OR Userspace can just close the device. vhost_vsock_dev_release would eventually call vhost_dev_cleanup (vhost_dev_flush won't see a worker so will just return), and that will hit the WARN_ON but we would proceed ok. If I do a memset of the worker, then if userspace were to retry VHOST_SET_OWNER, we would lose the queued work since the work_list would get zero'd. I think it's unlikely this ever happens, but you know best so let me know if this a real issue. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND
On 6/1/23 5:47 AM, Christian Brauner wrote: > On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote: >> On 19.05.23 14:15, Christian Brauner wrote: >>> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote: >>>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote: >>>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD, >>>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the >>>>> normal testing, haven't coverted vsock and vdpa, and I know you guys >>>>> will not like the first patch. However, I think it better shows what >>>> Just to summarize the core idea behind my proposal is that no signal >>>> handling changes are needed unless there's a bug in the current way >>>> io_uring workers already work. All that should be needed is >>>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c. >> [...] >>>> So it feels like this should be achievable by adding a callback to >>>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL >>>> and that all the users of vhost workers are forced to implement. >>>> >>>> Yes, it is more work but I think that's the right thing to do and not to >>>> complicate our signal handling. >>>> >>>> Worst case if this can't be done fast enough we'll have to revert the >>>> vhost parts. I think the user worker parts are mostly sane and are >>> As mentioned, if we can't settle this cleanly before -rc4 we should >>> revert the vhost parts unless Linus wants to have it earlier. >> Meanwhile -rc5 is just a few days away and there are still a lot of >> discussions in the patch-set proposed to address the issues[1]. Which is >> kinda great (albeit also why I haven't given it a spin yet), but on the >> other hand makes we wonder: > You might've missed it in the thread but it seems everyone is currently > operating under the assumption that the preferred way is to fix this is > rather than revert. See the mail in [1]: > > "So I'd really like to finish this. Even if we end up with a hack or > two in signal handling that we can hopefully fix up later by having > vhost fix up some of its current assumptions." > > which is why no revert was send for -rc4. And there's a temporary fix we > seem to have converged on. > > @Mike, do you want to prepare an updated version of the temporary fix. > If @Linus prefers to just apply it directly he can just grab it from the > list rather than delaying it. Make sure to grab a Co-developed-by line > on this, @Mike. Yes, I'll send it within a couple hours. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue
On 5/31/23 10:15 AM, Mike Christie wrote: >>> rcu would work for your case and for what Jason had requested. >> Yeah, so you already have some patches? >> >> Do you want to send it to solve this problem? >> > Yeah, I'll break them out and send them later today when I can retest > rebased patches. > Just one question. Do you core vhost developers consider RCU more complex or switching to READ_ONCE/WRITE_ONCE? I am asking because for this immediate regression fix we could just switch to the latter like below to just fix the crash if we think that is more simple. I think RCU is just a little more complex/invasive because it will have the extra synchronize_rcu calls. diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..03fd47a22a73 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -235,7 +235,7 @@ void vhost_dev_flush(struct vhost_dev *dev) { struct vhost_flush_struct flush; - if (dev->worker) { + if (READ_ONCE(dev->worker.vtsk)) { init_completion(&flush.wait_event); vhost_work_init(&flush.work, vhost_flush_work); @@ -247,7 +247,9 @@ EXPORT_SYMBOL_GPL(vhost_dev_flush); void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) { - if (!dev->worker) + struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk); + + if (!vtsk) return; if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) { @@ -255,8 +257,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work) * sure it was not in the list. * test_and_set_bit() implies a memory barrier. */ - llist_add(&work->node, &dev->worker->work_list); - wake_up_process(dev->worker->vtsk->task); + llist_add(&work->node, &dev->worker.work_list); + wake_up_process(vtsk->task); } } EXPORT_SYMBOL_GPL(vhost_work_queue); @@ -264,7 +266,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue); /* A lockless hint for busy polling code to exit the loop */ bool vhost_has_work(struct vhost_dev *dev) { - return dev->worker && !llist_empty(&dev->worker->work_list); + return !llist_empty(&dev->worker.work_list); } EXPORT_SYMBOL_GPL(vhost_has_work); @@ -468,7 +470,7 @@ void vhost_dev_init(struct vhost_dev *dev, dev->umem = NULL; dev->iotlb = NULL; dev->mm = NULL; - dev->worker = NULL; + memset(&dev->worker, 0, sizeof(dev->worker)); dev->iov_limit = iov_limit; dev->weight = weight; dev->byte_weight = byte_weight; @@ -542,46 +544,38 @@ static void vhost_detach_mm(struct vhost_dev *dev) static void vhost_worker_free(struct vhost_dev *dev) { - struct vhost_worker *worker = dev->worker; + struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk); - if (!worker) + if (!vtsk) return; - dev->worker = NULL; - WARN_ON(!llist_empty(&worker->work_list)); - vhost_task_stop(worker->vtsk); - kfree(worker); + vhost_task_stop(vtsk); + WARN_ON(!llist_empty(&dev->worker.work_list)); + WRITE_ONCE(dev->worker.vtsk, NULL); } static int vhost_worker_create(struct vhost_dev *dev) { - struct vhost_worker *worker; struct vhost_task *vtsk; char name[TASK_COMM_LEN]; int ret; - worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT); - if (!worker) - return -ENOMEM; - - dev->worker = worker; - worker->kcov_handle = kcov_common_handle(); - init_llist_head(&worker->work_list); + dev->worker.kcov_handle = kcov_common_handle(); + init_llist_head(&dev->worker.work_list); snprintf(name, sizeof(name), "vhost-%d", current->pid); - vtsk = vhost_task_create(vhost_worker, worker, name); + vtsk = vhost_task_create(vhost_worker, &dev->worker, name); if (!vtsk) { ret = -ENOMEM; goto free_worker; } - worker->vtsk = vtsk; + WRITE_ONCE(dev->worker.vtsk, vtsk); vhost_task_start(vtsk); return 0; free_worker: - kfree(worker); - dev->worker = NULL; + WRITE_ONCE(dev->worker.vtsk, NULL); return ret; } diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index 0308638cdeee..305ec8593d46 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -154,7 +154,7 @@ struct vhost_dev { struct vhost_virtqueue **vqs; int nvqs; struct eventfd_ctx *log_ctx; - struct vhost_worker *worker; + struct vhost_worker worker; struct vhost_iotlb *umem; struct vhost_iotlb *iotlb; spinlock_t iotlb_lock; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue
On 5/31/23 2:27 AM, Stefano Garzarella wrote: > On Tue, May 30, 2023 at 6:30 PM wrote: >> >> On 5/30/23 11:17 AM, Stefano Garzarella wrote: >>> On Tue, May 30, 2023 at 11:09:09AM -0500, Mike Christie wrote: >>>> On 5/30/23 11:00 AM, Stefano Garzarella wrote: >>>>> I think it is partially related to commit 6e890c5d5021 ("vhost: use >>>>> vhost_tasks for worker threads") and commit 1a5f8090c6de ("vhost: move >>>>> worker thread fields to new struct"). Maybe that commits just >>>>> highlighted the issue and it was already existing. >>>> >>>> See my mail about the crash. Agree with your analysis about worker->vtsk >>>> not being set yet. It's a bug from my commit where I should have not set >>>> it so early or I should be checking for >>>> >>>> if (dev->worker && worker->vtsk) >>>> >>>> instead of >>>> >>>> if (dev->worker) >>> >>> Yes, though, in my opinion the problem may persist depending on how the >>> instructions are reordered. >> >> Ah ok. >> >>> >>> Should we protect dev->worker() with an RCU to be safe? >> >> For those multiple worker patchsets Jason had asked me about supporting >> where we don't have a worker while we are swapping workers around. To do >> that I had added rcu around the dev->worker. I removed it in later patchsets >> because I didn't think anyone would use it. >> >> rcu would work for your case and for what Jason had requested. > > Yeah, so you already have some patches? > > Do you want to send it to solve this problem? > Yeah, I'll break them out and send them later today when I can retest rebased patches. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 5/29/23 9:38 PM, Eric W. Biederman wrote: > diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c > index b7cbd66f889e..f3ce0fa6edd7 100644 > --- a/kernel/vhost_task.c > +++ b/kernel/vhost_task.c ... > static int vhost_task_fn(void *data) > { > struct vhost_task *vtsk = data; > - int ret; > + bool dead = false; > + > + for (;;) { > + bool did_work; > + > + /* mb paired w/ kthread_stop */ > + set_current_state(TASK_INTERRUPTIBLE); > + > + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) { > + __set_current_state(TASK_RUNNING); > + break; > + } > + > + if (!dead && signal_pending(current)) { > + struct ksignal ksig; > + /* > + * Calling get_signal will block in SIGSTOP, > + * or clear fatal_signal_pending, but remember > + * what was set. > + * > + * This thread won't actually exit until all > + * of the file descriptors are closed, and > + * the release function is called. > + */ > + dead = get_signal(&ksig); Hey Eric, the patch works well. Thanks! There was just one small issue. get_signal() does try_to_freeze() -> ... __might_sleep() which wants the state to be TASK_RUNNING. We just need the patch below on top of yours which I think also cleans up some of the state setting weirdness with the code like where vhost.c calls __set_current_state(TASK_RUNNING) for each work. It looks like that was not needed for any reason like a work->fn changing the state and the old code could have done: node = llist_del_all(&worker->work_list); if (!node) { schedule(); continue; } else { __set_current_state(TASK_RUNNING); } So I think we can do the following on top of your patch: diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 221d1b6c1be5..074273020849 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -346,7 +346,6 @@ static bool vhost_worker(void *data) smp_wmb(); llist_for_each_entry_safe(work, work_next, node, node) { clear_bit(VHOST_WORK_QUEUED, &work->flags); - __set_current_state(TASK_RUNNING); kcov_remote_start_common(worker->kcov_handle); work->fn(work); kcov_remote_stop(); diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index f3ce0fa6edd7..fead2ed29561 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -29,12 +29,8 @@ static int vhost_task_fn(void *data) bool did_work; /* mb paired w/ kthread_stop */ - set_current_state(TASK_INTERRUPTIBLE); - - if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) { - __set_current_state(TASK_RUNNING); + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) break; - } if (!dead && signal_pending(current)) { struct ksignal ksig; @@ -53,8 +49,10 @@ static int vhost_task_fn(void *data) } did_work = vtsk->fn(vtsk->data); - if (!did_work) + if (!did_work) { + set_current_state(TASK_INTERRUPTIBLE); schedule(); + } } complete(&vtsk->exited); ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue
On 5/30/23 11:00 AM, Stefano Garzarella wrote: > I think it is partially related to commit 6e890c5d5021 ("vhost: use > vhost_tasks for worker threads") and commit 1a5f8090c6de ("vhost: move > worker thread fields to new struct"). Maybe that commits just > highlighted the issue and it was already existing. See my mail about the crash. Agree with your analysis about worker->vtsk not being set yet. It's a bug from my commit where I should have not set it so early or I should be checking for if (dev->worker && worker->vtsk) instead of if (dev->worker) One question about the behavior before my commit though and what we want in the end going forward. Before that patch we would just drop work if vhost_work_queue was called before VHOST_SET_OWNER. Was that correct/expected? The call to vhost_work_queue in vhost_vsock_start was only seeing the works queued after VHOST_SET_OWNER. Did you want works queued before that? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue
On 5/30/23 10:58 AM, Mike Christie wrote: > On 5/30/23 8:44 AM, Stefano Garzarella wrote: >> >> From a first glance, it looks like an issue when we call vhost_work_queue(). >> @Mike, does that ring any bells since you recently looked at that code? > > I see the bug. needed to have set the dev->worker after setting worker->vtsk > like below: > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index a92af08e7864..7bd95984a501 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -564,7 +564,6 @@ static int vhost_worker_create(struct vhost_dev *dev) > if (!worker) > return -ENOMEM; > > - dev->worker = worker; > worker->kcov_handle = kcov_common_handle(); > init_llist_head(&worker->work_list); > snprintf(name, sizeof(name), "vhost-%d", current->pid); > @@ -576,6 +575,7 @@ static int vhost_worker_create(struct vhost_dev *dev) > } > > worker->vtsk = vtsk; Shoot, oh wait, I think I needed a smp_wmb to always make sure worker->vtask is set before dev->worker or vhost_work_queue could still end up seeing dev->worker set before worker->vtsk right? > + dev->worker = worker;> vhost_task_start(vtsk); > return 0; > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue
On 5/30/23 8:44 AM, Stefano Garzarella wrote: > > From a first glance, it looks like an issue when we call vhost_work_queue(). > @Mike, does that ring any bells since you recently looked at that code? I see the bug. needed to have set the dev->worker after setting worker->vtsk like below: diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..7bd95984a501 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -564,7 +564,6 @@ static int vhost_worker_create(struct vhost_dev *dev) if (!worker) return -ENOMEM; - dev->worker = worker; worker->kcov_handle = kcov_common_handle(); init_llist_head(&worker->work_list); snprintf(name, sizeof(name), "vhost-%d", current->pid); @@ -576,6 +575,7 @@ static int vhost_worker_create(struct vhost_dev *dev) } worker->vtsk = vtsk; + dev->worker = worker; vhost_task_start(vtsk); return 0; ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 5/29/23 9:38 PM, Eric W. Biederman wrote: > Mike is there any chance you can test the change below? Yeah, I'm firing up tests now. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 5/29/23 12:46 PM, Oleg Nesterov wrote: > Mike, sorry, I don't understand your email. > > Just in case, let me remind I know nothing about drivers/vhost/ > No problem. I get it. I don't know the signal/thread code so it's one of those things where I'm also learning as I go. > On 05/29, michael.chris...@oracle.com wrote: >> >> On 5/29/23 6:19 AM, Oleg Nesterov wrote: >>> On 05/27, Eric W. Biederman wrote: Looking forward I don't see not asking the worker threads to stop for the coredump right now causing any problems in the future. So I think we can use this to resolve the coredump issue I spotted. >>> >>> But we have almost the same problem with exec. >>> >>> Execing thread will wait for vhost_worker() while vhost_worker will wait for >>> .release -> vhost_task_stop(). >> >> For this type of case, what is the goal or correct behavior in the end? >> >> When get_signal returns true we can code things like you mention below > > and you have mentioned in the next email that you have already coded something > like this, so perhaps we can delay the further discussions until you send the > new code? > Ok. Let me post that. You guys and the vhost devs can argue about if it's too ugly to merge :) >> and >> clean up the task_struct. > > Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap > itself, Oh wait, are you saying that when we get auto-reaped then we would do the last fput and call the file_operations->release function right? We actually set task_struct->files = NULL for the vhost_task task_struct, so I think we call release a little sooner than you think. vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task task_struc that gets created works like kthreads where it doesn't do a CLONE_FILES and it doesn't do a dup_fd. So when we do de_thread() -> zap_other_threads(), that will kill all the threads in the group right? So when they exit, it will call our release function since we don't have refcount on ourself. > >> However, we now have a non-functioning vhost device >> open and just sitting around taking up memory and it can't do any IO. > > can't comment, see above. > >> For this type of case, do we expect just not to crash/hang, or was this new >> exec'd thread suppose to be able to use the vhost device? > > I just tried to point out that (unless I missed something) there are more > corner > cases, not just coredump. Ok. I see. > >>> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES... >> >> You mean the vhost_task's task/thread doing a function that does a >> copy_process >> right? > > I meant that the vhost_task's sub-thread can do sys_clone(CLONE_FILES) from > userspace. I think we were talking about different things. I was saying that when the vhost layer does vhost_task_create() -> copy_process(), the kernel_clone_args.fn is vhost_task_fn() -> vhost_worker(). I thought you were concerned about vhost_worker() or some function it calls, calling copy_process() with CLONE_FILES. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 5/29/23 12:54 PM, Oleg Nesterov wrote: > On 05/29, Oleg Nesterov wrote: >> >> Mike, sorry, I don't understand your email. >> >> Just in case, let me remind I know nothing about drivers/vhost/ > > And... this is slightly off-topic but you didn't answer my previous > question and I am just curious. Do you agree that, even if we forget > about CLONE_THREAD/signals, vhost_worker() needs fixes anyway because > it can race with vhost_work_queue() ? You saw the reply where I wrote I was waiting for the vhost devs to reply because it's their code, right? Just wanted to make sure you know I'm not ignoring your mails. The info is very valuable to me since I don't know the signal code. - I think you are right about using a continue after schedule. - The work fn is stable. It's setup once and never changes. - I have no idea why we do the __set_current_state(TASK_RUNNING). As far as I can tell the work functions do not change the task state and that might have been a left over from something else. However, the vhost devs might know of some case. - For the barrier use, I think you are right, but I couldn't trigger an issue even if I hack up different timings. So I was hoping the vhost devs know of something else in there. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 5/27/23 8:41 PM, Eric W. Biederman wrote: > Mike Christie writes: > >> On 5/23/23 7:15 AM, Oleg Nesterov wrote: >>> >>> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right >>> before we call work->fn(). Is it "safe" to run this callback with >>> signal_pending() or fatal_signal_pending() ? >> >> The questions before this one I'll leave for the core vhost devs since >> they know best. > > Let me ask a clarifying question: > > Is it only the call to schedule() in vhost_worker that you are worried > about not sleeping if signal_pending() or fatal_signal_pending()? It will only be the vhost_worker call to schedule(). When we do the file_operation's release call, we normally set things up so the work->fn just fails and cleans up. I can pretty easily move that code into a helper and do: if (get_signal(ksig)) new_function_to_tell_drivers_that_all_work_fns_should_fail() ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 5/23/23 7:15 AM, Oleg Nesterov wrote: > > Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > before we call work->fn(). Is it "safe" to run this callback with > signal_pending() or fatal_signal_pending() ? The questions before this one I'll leave for the core vhost devs since they know best. For this question and the one below, when we get SIGKILL we think it's ok to fail the operation as in it's ok to not execute it like normal (send it down to the net/target/scsi/block layers for execution). However, we need to do some processing of the work to release mem, refcounts, etc. For example we use the vhost_task to handle completions from the layers we interact with. So when we get a SIGKILL, we could have N commands in the target/block/scsi/nvme layers below the vhost layer. When those complete, the current code assumes we have the vhost_task to handle the responses. So for pending works on that work_list, we can pretty easily kill them. However, we don't have a way to kill those outstanding requests to some other layer, so we have to wait and handle them somewhere. If you are saying that once we get SIGKILL then we just can't use the task anymore and we have to drop down to workqueue/kthread or change up the completion paths so they can execute in the context they are completed from by lower levels, then I can code this. On the vhost side, it's just really ugly vs the code we have now that used to use kthreads (or in 6.4-rc looks like a process) so we couldn't get signals and just had the one code path that removes us. >From the vhost point of view signals are not useful to us and it's just adding complexity for something that I don't think is useful to users. However after discussing this with guys for a week and going over the signal code, I can see your point of views where you guys are thinking its ugly for the signal code to try and support what we want :) > > > Finally. I never looked into drivers/vhost/ before so I don't understand > this code at all, but let me ask anyway... Can we change vhost_dev_flush() > to run the pending callbacks rather than wait for vhost_worker() ? > I guess we can't, ->mm won't be correct, but can you confirm? > > Oleg. > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/3] vhost-scsi: Fix alignment handling with windows
On 5/25/23 2:57 AM, Michael S. Tsirkin wrote: > On Wed, May 24, 2023 at 06:34:05PM -0500, Mike Christie wrote: >> The linux block layer requires bios/requests to have lengths with a 512 >> byte alignment. Some drivers/layers like dm-crypt and the directi IO code >> will test for it and just fail. Other drivers like SCSI just assume the >> requirement is met and will end up in infinte retry loops. The problem >> for drivers like SCSI is that it uses functions like blk_rq_cur_sectors >> and blk_rq_sectors which divide the request's length by 512. If there's >> lefovers then it just gets dropped. But other code in the block/scsi >> layer may use blk_rq_bytes/blk_rq_cur_bytes and end up thinking there is >> still data left and try to retry the cmd. We can then end up getting >> stuck in retry loops where part of the block/scsi thinks there is data >> left, but other parts think we want to do IOs of zero length. >> >> Linux will always check for alignment, but windows will not. When >> vhost-scsi then translates the iovec it gets from a windows guest to a >> scatterlist, we can end up with sg items where the sg->length is not >> divisible by 512 due to the misaligned offset: >> >> sg[0].offset = 255; >> sg[0].length = 3841; >> sg... >> sg[N].offset = 0; >> sg[N].length = 255; >> >> When the lio backends then convert the SG to bios or other iovecs, we >> end up sending them with the same misaligned values and can hit the >> issues above. >> >> This just has us drop down to allocating a temp page and copying the data >> when this happens. Because this adds a check that needs to loop over the >> iovec in the main IO path, this patch adds an attribute that can be turned >> on for just windows. >> >> Signed-off-by: Mike Christie > > I am assuming this triggers rarely, yes? > > If so I would like to find a way to avoid setting an attribute. > > I am guessing the main cost is an extra scan of iov. The scan and a memory allocation to dup the iter. However, I see iov_iter_revert and I think that might be what I needed before to avoid the mem alloc so will try it out. > vhost_scsi_iov_to_sgl already does a scan, how about changing > it to fail if misaligned? > We can then detect the relevant error code and try with a copy. > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 3/3] vhost-scsi: Rename vhost_scsi_iov_to_sgl
Rename vhost_scsi_iov_to_sgl to vhost_scsi_map_iov_to_sgl so it matches matches the naming style used for vhost_scsi_copy_iov_to_sgl. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 382158b39740..a4d32b96e66a 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -694,8 +694,8 @@ vhost_scsi_calc_sgls(struct iov_iter *iter, size_t bytes, int max_sgls) } static int -vhost_scsi_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter, - struct scatterlist *sg, int sg_count) +vhost_scsi_map_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter, + struct scatterlist *sg, int sg_count) { struct scatterlist *p = sg; int ret; @@ -778,8 +778,9 @@ vhost_scsi_mapal(struct vhost_scsi_tpg *tpg, struct vhost_scsi_cmd *cmd, pr_debug("%s prot_sg %p prot_sgl_count %u\n", __func__, cmd->tvc_prot_sgl, cmd->tvc_prot_sgl_count); - ret = vhost_scsi_iov_to_sgl(cmd, prot_iter, cmd->tvc_prot_sgl, - cmd->tvc_prot_sgl_count); + ret = vhost_scsi_map_iov_to_sgl(cmd, prot_iter, + cmd->tvc_prot_sgl, + cmd->tvc_prot_sgl_count); if (ret < 0) goto map_fail; } @@ -808,8 +809,8 @@ vhost_scsi_mapal(struct vhost_scsi_tpg *tpg, struct vhost_scsi_cmd *cmd, ret = vhost_scsi_copy_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl, cmd->tvc_sgl_count); else - ret = vhost_scsi_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl, - cmd->tvc_sgl_count); + ret = vhost_scsi_map_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl, + cmd->tvc_sgl_count); if (ret) goto map_fail; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/3] vhost-scsi: Fix alignment handling with windows
The linux block layer requires bios/requests to have lengths with a 512 byte alignment. Some drivers/layers like dm-crypt and the directi IO code will test for it and just fail. Other drivers like SCSI just assume the requirement is met and will end up in infinte retry loops. The problem for drivers like SCSI is that it uses functions like blk_rq_cur_sectors and blk_rq_sectors which divide the request's length by 512. If there's lefovers then it just gets dropped. But other code in the block/scsi layer may use blk_rq_bytes/blk_rq_cur_bytes and end up thinking there is still data left and try to retry the cmd. We can then end up getting stuck in retry loops where part of the block/scsi thinks there is data left, but other parts think we want to do IOs of zero length. Linux will always check for alignment, but windows will not. When vhost-scsi then translates the iovec it gets from a windows guest to a scatterlist, we can end up with sg items where the sg->length is not divisible by 512 due to the misaligned offset: sg[0].offset = 255; sg[0].length = 3841; sg... sg[N].offset = 0; sg[N].length = 255; When the lio backends then convert the SG to bios or other iovecs, we end up sending them with the same misaligned values and can hit the issues above. This just has us drop down to allocating a temp page and copying the data when this happens. Because this adds a check that needs to loop over the iovec in the main IO path, this patch adds an attribute that can be turned on for just windows. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 174 +-- 1 file changed, 151 insertions(+), 23 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index bb10fa4bb4f6..dbad8fb579eb 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -25,6 +25,8 @@ #include #include #include +#include +#include #include #include #include @@ -75,6 +77,9 @@ struct vhost_scsi_cmd { u32 tvc_prot_sgl_count; /* Saved unpacked SCSI LUN for vhost_scsi_target_queue_cmd() */ u32 tvc_lun; + u32 copied_iov:1; + const void *saved_iter_addr; + struct iov_iter saved_iter; /* Pointer to the SGL formatted memory from virtio-scsi */ struct scatterlist *tvc_sgl; struct scatterlist *tvc_prot_sgl; @@ -107,6 +112,7 @@ struct vhost_scsi_nexus { struct vhost_scsi_tpg { /* Vhost port target portal group tag for TCM */ u16 tport_tpgt; + bool check_io_alignment; /* Used to track number of TPG Port/Lun Links wrt to explict I_T Nexus shutdown */ int tv_tpg_port_count; /* Used for vhost_scsi device reference to tpg_nexus, protected by tv_tpg_mutex */ @@ -328,8 +334,13 @@ static void vhost_scsi_release_cmd_res(struct se_cmd *se_cmd) int i; if (tv_cmd->tvc_sgl_count) { - for (i = 0; i < tv_cmd->tvc_sgl_count; i++) - put_page(sg_page(&tv_cmd->tvc_sgl[i])); + for (i = 0; i < tv_cmd->tvc_sgl_count; i++) { + if (tv_cmd->copied_iov) + __free_page(sg_page(&tv_cmd->tvc_sgl[i])); + else + put_page(sg_page(&tv_cmd->tvc_sgl[i])); + } + kfree(tv_cmd->saved_iter_addr); } if (tv_cmd->tvc_prot_sgl_count) { for (i = 0; i < tv_cmd->tvc_prot_sgl_count; i++) @@ -502,6 +513,27 @@ static void vhost_scsi_evt_work(struct vhost_work *work) mutex_unlock(&vq->mutex); } +static int vhost_scsi_copy_sgl_to_iov(struct vhost_scsi_cmd *cmd) +{ + struct iov_iter *iter = &cmd->saved_iter; + struct scatterlist *sg = cmd->tvc_sgl; + struct page *page; + size_t len; + int i; + + for (i = 0; i < cmd->tvc_sgl_count; i++) { + page = sg_page(&sg[i]); + len = sg[i].length; + + if (copy_page_to_iter(page, 0, len, iter) != len) { + pr_err("Could not copy data. Error %lu\n", len); + return -1; + } + } + + return 0; +} + /* Fill in status and signal that we are done processing this command * * This is scheduled in the vhost work queue so we are called with the owner @@ -525,15 +557,20 @@ static void vhost_scsi_complete_cmd_work(struct vhost_work *work) pr_debug("%s tv_cmd %p resid %u status %#02x\n", __func__, cmd, se_cmd->residual_count, se_cmd->scsi_status); - memset(&v_rsp, 0, sizeof(v_rsp)); - v_rsp.resid = cpu_to_vhost32(cmd->tvc_vq, se_cmd->residual_count); - /* TODO is status_qualifier field needed? */ - v_rsp.status = se_cmd->scsi_status; - v_rsp.sense_len = cpu_to_vh
[PATCH 2/3] vhost-scsi: Remove unused write argument
The write arg that's passed to the mapping functions is not used so remove it. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 19 +++ 1 file changed, 7 insertions(+), 12 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index dbad8fb579eb..382158b39740 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -649,10 +649,8 @@ vhost_scsi_get_cmd(struct vhost_virtqueue *vq, struct vhost_scsi_tpg *tpg, * Returns the number of scatterlist entries used or -errno on error. */ static int -vhost_scsi_map_to_sgl(struct vhost_scsi_cmd *cmd, - struct iov_iter *iter, - struct scatterlist *sgl, - bool write) +vhost_scsi_map_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter, + struct scatterlist *sgl) { struct page **pages = cmd->tvc_upages; struct scatterlist *sg = sgl; @@ -696,15 +694,14 @@ vhost_scsi_calc_sgls(struct iov_iter *iter, size_t bytes, int max_sgls) } static int -vhost_scsi_iov_to_sgl(struct vhost_scsi_cmd *cmd, bool write, - struct iov_iter *iter, +vhost_scsi_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter, struct scatterlist *sg, int sg_count) { struct scatterlist *p = sg; int ret; while (iov_iter_count(iter)) { - ret = vhost_scsi_map_to_sgl(cmd, iter, sg, write); + ret = vhost_scsi_map_to_sgl(cmd, iter, sg); if (ret < 0) { while (p < sg) { struct page *page = sg_page(p++); @@ -769,7 +766,6 @@ vhost_scsi_mapal(struct vhost_scsi_tpg *tpg, struct vhost_scsi_cmd *cmd, size_t data_bytes, struct iov_iter *data_iter) { int sgl_count, ret; - bool write = (cmd->tvc_data_direction == DMA_FROM_DEVICE); if (prot_bytes) { sgl_count = vhost_scsi_calc_sgls(prot_iter, prot_bytes, @@ -782,8 +778,7 @@ vhost_scsi_mapal(struct vhost_scsi_tpg *tpg, struct vhost_scsi_cmd *cmd, pr_debug("%s prot_sg %p prot_sgl_count %u\n", __func__, cmd->tvc_prot_sgl, cmd->tvc_prot_sgl_count); - ret = vhost_scsi_iov_to_sgl(cmd, write, prot_iter, - cmd->tvc_prot_sgl, + ret = vhost_scsi_iov_to_sgl(cmd, prot_iter, cmd->tvc_prot_sgl, cmd->tvc_prot_sgl_count); if (ret < 0) goto map_fail; @@ -813,8 +808,8 @@ vhost_scsi_mapal(struct vhost_scsi_tpg *tpg, struct vhost_scsi_cmd *cmd, ret = vhost_scsi_copy_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl, cmd->tvc_sgl_count); else - ret = vhost_scsi_iov_to_sgl(cmd, write, data_iter, - cmd->tvc_sgl, cmd->tvc_sgl_count); + ret = vhost_scsi_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl, + cmd->tvc_sgl_count); if (ret) goto map_fail; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 0/3] vhost-scsi: Fix IO hangs when using windows
The following patches were made over Linus's tree and fix an issue where windows guests will send iovecs with offset/lengths that result in IOs that are not aligned to 512. The LIO layer will then send them to Linux's block layer but it requires 512 byte alignment, so depending on the block driver being used we will get IO errors or hung IO. The following patches have vhost-scsi detect when windows sends these IOs and copy them to a bounce buffer. It then does some cleanup in the related code. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
Hey Oleg, For all these questions below let me get back to you by tomorrow. I need to confirm if something would be considered a regression by the core vhost developers. On 5/23/23 7:15 AM, Oleg Nesterov wrote: > On 05/22, Oleg Nesterov wrote: >> >> Right now I think that "int dead" should die, > > No, probably we shouldn't call get_signal() if we have already dequeued > SIGKILL. > >> but let me think tomorrow. > > May be something like this... I don't like it but I can't suggest anything > better > right now. > > bool killed = false; > > for (;;) { > ... > > node = llist_del_all(&worker->work_list); > if (!node) { > schedule(); > /* >* When we get a SIGKILL our release function will >* be called. That will stop new IOs from being queued >* and check for outstanding cmd responses. It will then >* call vhost_task_stop to tell us to return and exit. >*/ > if (signal_pending(current)) { > struct ksignal ksig; > > if (!killed) > killed = get_signal(&ksig); > > clear_thread_flag(TIF_SIGPENDING); > } > > continue; > } > > --- > But let me ask a couple of questions. Let's forget this patch, let's look at > the > current code: > > node = llist_del_all(&worker->work_list); > if (!node) > schedule(); > > node = llist_reverse_order(node); > ... process works ... > > To me this looks a bit confusing. Shouldn't we do > > if (!node) { > schedule(); > continue; > } > > just to make the code a bit more clear? If node == NULL then > llist_reverse_order() and llist_for_each_entry_safe() will do nothing. > But this is minor. > > > > /* make sure flag is seen after deletion */ > smp_wmb(); > llist_for_each_entry_safe(work, work_next, node, node) { > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED, > vhost_work_queue() can add this work again and change work->node->next. > > That is why we use _safe, but we need to ensure that llist_for_each_safe() > completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared. > > So it seems that smp_wmb() can't help and should be removed, instead we need > > llist_for_each_entry_safe(...) { > smp_mb__before_atomic(); > clear_bit(VHOST_WORK_QUEUED, &work->flags); > > Also, if the work->fn pointer is not stable, we should read it before > smp_mb__before_atomic() as well. > > No? > > > __set_current_state(TASK_RUNNING); > > Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn() > can return with current->state != RUNNING ? > > > work->fn(work); > > Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right > before we call work->fn(). Is it "safe" to run this callback with > signal_pending() or fatal_signal_pending() ? > > > Finally. I never looked into drivers/vhost/ before so I don't understand > this code at all, but let me ask anyway... Can we change vhost_dev_flush() > to run the pending callbacks rather than wait for vhost_worker() ? > I guess we can't, ->mm won't be correct, but can you confirm? > > Oleg. > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 5/22/23 2:40 PM, Michael S. Tsirkin wrote: >> return copy_process(NULL, 0, node, &args); >> diff --git a/kernel/signal.c b/kernel/signal.c >> index 8050fe23c732..a0f00a078cbb 100644 >> --- a/kernel/signal.c >> +++ b/kernel/signal.c >> @@ -2891,6 +2891,7 @@ bool get_signal(struct ksignal *ksig) >> >> return ksig->sig > 0; >> } >> +EXPORT_SYMBOL_GPL(get_signal); > > If you are exporting this, could you add documentation please? > Ok. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
On 5/22/23 7:30 AM, Oleg Nesterov wrote: > Confused, please help... > > On 05/21, Mike Christie wrote: >> >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -338,6 +338,7 @@ static int vhost_worker(void *data) >> struct vhost_worker *worker = data; >> struct vhost_work *work, *work_next; >> struct llist_node *node; >> +bool dead = false; >> >> for (;;) { >> /* mb paired w/ kthread_stop */ >> @@ -349,8 +350,22 @@ static int vhost_worker(void *data) >> } >> >> node = llist_del_all(&worker->work_list); >> -if (!node) >> +if (!node) { >> schedule(); >> +/* >> + * When we get a SIGKILL our release function will >> + * be called. That will stop new IOs from being queued >> + * and check for outstanding cmd responses. It will then >> + * call vhost_task_stop to tell us to return and exit. >> + */ > > But who will call the release function / vhost_task_stop() and when this > will happen after this thread gets SIGKILL ? When we get a SIGKILL, the thread that owns the device/vhost_task will also exit since it's the same thread group and it does: do_exit -> exit_files -> put_files_struct -> close_files -> fput which eventually (the actual put is done via the delayed work path in there) runs the file_operation->release. So the release function is being called in parallel more or less as the code above. > >> +if (!dead && signal_pending(current)) { >> +struct ksignal ksig; >> + >> +dead = get_signal(&ksig); >> +if (dead) >> +clear_thread_flag(TIF_SIGPENDING); > > If you do clear_thread_flag(TIF_SIGPENDING), then why do we need 1/3 ? You're right. I don't need it. I thought __fatal_signal_pending checked the shared pending as well but it only checks pending. > Also. Suppose that vhost_worker() dequeues SIGKILL and clears TIF_SIGPENDING. > > SIGSTOP, PTRACE_INTERRUPT, freezer can come and set TIF_SIGPENDING again. > In this case the main for (;;) loop will spin without sleeping until > vhost_task_should_stop() becomes true? I see. So I either have to be able to call get_signal after SIGKILL or at this time work like a kthread and ignore signals like a if (dead && signal_pending()) flush_signals() ? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 0/3] vhost: Fix freezer/ps regressions
The following patches made over Linus's tree fix the 2 bugs: 1. vhost worker task shows up as a process forked from the parent that did VHOST_SET_OWNER ioctl instead of a process under root/kthreadd. This was causing breaking scripts. 2. vhost_tasks didn't disable or add support for freeze requests. The following patches fix these issues by making the vhost_task task a thread under the process that did the VHOST_SET_OWNER and uses get_signal() to handle freeze and SIGSTOP/KILL signals which is required when using CLONE_THREAD (really CLONE_THREAD requires CLONE_SIGHAND which requires SIGKILL/STOP to be supported). ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression
When switching from kthreads to vhost_tasks two bugs were added: 1. The vhost worker tasks's now show up as processes so scripts doing ps or ps a would not incorrectly detect the vhost task as another process. 2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's didn't disable or add support for them. To fix both bugs, this switches the vhost task to be thread in the process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that SIGKILL/STOP support is required because CLONE_THREAD requires CLONE_SIGHAND which requires those 2 signals to be suppported. This a modified version of patch originally written by Linus which handles his review comment to himself to rename ignore_signals to block_signals to better represent what it now does. And it includes a change to vhost_worker() to support SIGSTOP/KILL and freeze, and it drops the wait use per Oleg's review comment that it's no longer needed when using CLONE_THREAD. Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads") Signed-off-by: Mike Christie --- drivers/vhost/vhost.c | 17 - include/linux/sched/task.h | 2 +- kernel/fork.c | 12 +++- kernel/signal.c| 1 + kernel/vhost_task.c| 16 5 files changed, 25 insertions(+), 23 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..bf83e9340e40 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -338,6 +338,7 @@ static int vhost_worker(void *data) struct vhost_worker *worker = data; struct vhost_work *work, *work_next; struct llist_node *node; + bool dead = false; for (;;) { /* mb paired w/ kthread_stop */ @@ -349,8 +350,22 @@ static int vhost_worker(void *data) } node = llist_del_all(&worker->work_list); - if (!node) + if (!node) { schedule(); + /* +* When we get a SIGKILL our release function will +* be called. That will stop new IOs from being queued +* and check for outstanding cmd responses. It will then +* call vhost_task_stop to tell us to return and exit. +*/ + if (!dead && signal_pending(current)) { + struct ksignal ksig; + + dead = get_signal(&ksig); + if (dead) + clear_thread_flag(TIF_SIGPENDING); + } + } node = llist_reverse_order(node); /* make sure flag is seen after deletion */ diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 537cbf9a2ade..249a5ece9def 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -29,7 +29,7 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; - u32 ignore_signals:1; + u32 block_signals:1; unsigned long stack; unsigned long stack_size; unsigned long tls; diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..9e04ab5c3946 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process( p->flags |= PF_KTHREAD; if (args->user_worker) p->flags |= PF_USER_WORKER; - if (args->io_thread) { - /* -* Mark us an IO worker, and block any signal that isn't -* fatal or STOP -*/ + if (args->io_thread) p->flags |= PF_IO_WORKER; + if (args->block_signals) siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); - } if (args->name) strscpy_pad(p->comm, args->name, sizeof(p->comm)); @@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); - stackleak_task_init(p); if (pid != &init_struct_pid) { @@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) .fn_arg = arg, .io_thread = 1, .user_worker= 1, + .block_signals = 1, }; return copy_process(NULL, 0, node, &args); diff --git a/kernel/signal.c b/kernel/signal.c index 8050fe23c732..a0f00a078cbb 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2891,6 +2891,7 @@ bool get_signal(struct ksignal *ksig)
[PATCH 2/3] signal: Don't exit for PF_USER_WORKER tasks
vhost_tasks also need to perform cleanup before exiting so this changes the check in get_signal to be more generic so both io thread and vhost tasks can do their cleanup. Signed-off-by: Mike Christie --- kernel/signal.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 3dc99b9aec7f..8050fe23c732 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2869,11 +2869,11 @@ bool get_signal(struct ksignal *ksig) } /* -* PF_IO_WORKER threads will catch and exit on fatal signals +* PF_USER_WORKER threads will catch and exit on fatal signals * themselves. They have cleanup that must be performed, so * we cannot call do_exit() on their behalf. */ - if (current->flags & PF_IO_WORKER) + if (current->flags & PF_USER_WORKER) goto out; /* -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/3] signal: Don't always put SIGKILL in shared_pending
When get_pending detects the task has been marked to be killed we try to clean up the SIGKLL by doing a sigdelset and recalc_sigpending, but we still leave it in shared_pending. If the signal is being short circuit delivered there is no need to put in shared_pending so this adds a check in complete_signal. This patch was modified from Eric Biederman original patch. Signed-off-by: Mike Christie --- kernel/signal.c | 8 1 file changed, 8 insertions(+) diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..3dc99b9aec7f 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -1052,6 +1052,14 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type) signal->flags = SIGNAL_GROUP_EXIT; signal->group_exit_code = sig; signal->group_stop_count = 0; + + /* +* The signal is being short circuit delivered so +* don't set pending. +*/ + if (type != PIDTYPE_PID) + sigdelset(&signal->shared_pending.signal, sig); + t = p; do { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
On 5/18/23 11:16 PM, Eric W. Biederman wrote: > Mike Christie writes: > >> On 5/18/23 1:28 PM, Eric W. Biederman wrote: >>> Still the big issue seems to be the way get_signal is connected into >>> these threads so that it keeps getting called. Calling get_signal after >>> a fatal signal has been returned happens nowhere else and even if we fix >>> it today it is likely to lead to bugs in the future because whoever is >>> testing and updating the code is unlikely they have a vhost test case >>> the care about. >>> >>> diff --git a/kernel/signal.c b/kernel/signal.c >>> index 8f6330f0e9ca..4d54718cad36 100644 >>> --- a/kernel/signal.c >>> +++ b/kernel/signal.c >>> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t) >>> >>> void recalc_sigpending(void) >>> { >>> - if (!recalc_sigpending_tsk(current) && !freezing(current)) >>> + if ((!recalc_sigpending_tsk(current) && !freezing(current)) || >>> + ((current->signal->flags & SIGNAL_GROUP_EXIT) && >>> + !__fatal_signal_pending(current))) >>> clear_thread_flag(TIF_SIGPENDING); >>> >>> } >>> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct >>> task_struct *p, enum pid_type type) >>> * This signal will be fatal to the whole group. >>> */ >>> if (!sig_kernel_coredump(sig)) { >>> + /* >>> +* The signal is being short circuit delivered >>> +* don't it pending. >>> +*/ >>> + if (type != PIDTYPE_PID) { >>> + sigdelset(&t->signal->shared_pending, sig); >>> + >>> /* >>> * Start a group exit and wake everybody up. >>> * This way we don't have other threads >>> >> >> If I change up your patch so the last part is moved down a bit to when we >> set t >> like this: >> >> diff --git a/kernel/signal.c b/kernel/signal.c >> index 0ac48c96ab04..c976a80650db 100644 >> --- a/kernel/signal.c >> +++ b/kernel/signal.c >> @@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t) >> >> void recalc_sigpending(void) >> { >> -if (!recalc_sigpending_tsk(current) && !freezing(current)) >> +if ((!recalc_sigpending_tsk(current) && !freezing(current)) || >> +((current->signal->flags & SIGNAL_GROUP_EXIT) && >> + !__fatal_signal_pending(current))) >> clear_thread_flag(TIF_SIGPENDING); >> - > Can we get rid of this suggestion to recalc_sigpending. The more I look > at it the more I am convinced it is not safe. In particular I believe > it is incompatible with dump_interrupted() in fs/coredump.c With your clear_thread_flag call in vhost_worker suggestion I don't need the above chunk. > > The code in fs/coredump.c is the closest code we have to what you are > trying to do with vhost_worker after the session is killed. It also > struggles with TIF_SIGPENDING getting set. >> } >> EXPORT_SYMBOL(recalc_sigpending); >> >> @@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct >> task_struct *p, enum pid_type type) >> signal->group_exit_code = sig; >> signal->group_stop_count = 0; >> t = p; >> +/* >> + * The signal is being short circuit delivered >> + * don't it pending. >> + */ >> +if (type != PIDTYPE_PID) { >> +struct sigpending *pending; >> + >> +pending = &t->signal->shared_pending; >> +sigdelset(&pending->signal, sig); >> +} >> + >> do { >> task_clear_jobctl_pending(t, >> JOBCTL_PENDING_MASK); >> sigaddset(&t->pending.signal, SIGKILL); >> >> >> Then get_signal() works like how Oleg mentioned it should earlier. > > I am puzzled it makes a difference as t->signal and p->signal should > point to the same thing, and in fact the code
Re: [PATCH 13/14] vhost: allow userspace to create workers
On 5/16/23 10:10 PM, Jason Wang wrote: > On Sat, Apr 29, 2023 at 12:32 AM Mike Christie > wrote: >> >> For vhost-scsi with 3 vqs or more and a workload that tries to use >> them in parallel like: >> >> fio --filename=/dev/sdb --direct=1 --rw=randrw --bs=4k \ >> --ioengine=libaio --iodepth=128 --numjobs=3 >> >> the single vhost worker thread will become a bottlneck and we are stuck >> at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are >> used. >> >> To better utilize virtqueues and available CPUs, this patch allows >> userspace to create workers and bind them to vqs. You can have N workers >> per dev and also share N workers with M vqs on that dev. >> >> This patch adds the interface related code and the next patch will hook >> vhost-scsi into it. The patches do not try to hook net and vsock into >> the interface because: >> >> 1. multiple workers don't seem to help vsock. The problem is that with >> only 2 virtqueues we never fully use the existing worker when doing >> bidirectional tests. This seems to match vhost-scsi where we don't see >> the worker as a bottleneck until 3 virtqueues are used. >> >> 2. net already has a way to use multiple workers. >> >> Signed-off-by: Mike Christie >> --- >> drivers/vhost/vhost.c| 145 ++- >> drivers/vhost/vhost.h| 3 + >> include/uapi/linux/vhost.h | 33 +++ >> include/uapi/linux/vhost_types.h | 16 >> 4 files changed, 196 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c >> index 4b0b82292379..e8f829f35814 100644 >> --- a/drivers/vhost/vhost.c >> +++ b/drivers/vhost/vhost.c >> @@ -630,6 +630,80 @@ static struct vhost_worker *vhost_worker_create(struct >> vhost_dev *dev) >> return NULL; >> } >> >> +/* Caller must have device mutex */ >> +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq, >> +struct vhost_worker *worker) >> +{ >> + if (vq->worker) >> + vq->worker->attachment_cnt--; >> + worker->attachment_cnt++; >> + vq->worker = worker; >> +} >> + >> +/** >> + * vhost_vq_attach_worker - set a virtqueue's worker from an ioctl command >> + * @vq: the virtqueue we will set the worker for >> + * @info: the worker userspace has requested us to use >> + * >> + * We only allow userspace to set a virtqueue's worker if it's not active >> and >> + * polling is not enabled. > > I wonder if we can mandate this in the code like check the vq backend > in vhost_vq_work_queue(). I'll look into it. However, for the vsock case we actually do want to queue the work even though there is no backend set yet. It sort of just keeps queueing works until VHOST_VSOCK_SET_RUNNING is executed. When that's done it will run the works that have been queueing up. > > We also assume drivers supporting this will not be >> + * internally queueing works directly or via calls like vhost_dev_flush at >> + * this time. >> + * >> + * Caller must have device and virtqueue mutex. >> + */ >> +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq, >> + struct vhost_vring_worker *info) >> +{ >> + unsigned long index = info->worker_id; >> + struct vhost_dev *dev = vq->dev; >> + struct vhost_worker *worker; >> + >> + if (!dev->use_worker) >> + return -EINVAL; >> + >> + if (vhost_vq_get_backend(vq) || vq->kick) > > It might be worthwhile to have a comment to explain why we need to > check vq->kick here. > > This also means the device should not queue work when the backend is NULL. > > But I found it is probably not the case for vsock, it calls > vhost_poll_queue() in vhost_transport_cancel_pkt() but > vhost_vsock_stop() doesn't wait before doing vhost_vq_set_backend(vq, > NULL); Yeah, there was another case where vhost_transport_send_pkt can call vhost_work_queue before the backend is set. I ended up doing the opt in method though. I did a test to convert vsock and a worker for the recv queue and one for the xmit queue doesn't help. It was like with vhost-scsi where with just 2 virtqueues, one worker could handle the load. So I thought there is no point in adding extra code for vsock if it wasn't going to be used. > > Net seems to be fine since it waits for ubufs to be completed in > vhost_net_se
Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
On 5/18/23 1:28 PM, Eric W. Biederman wrote: > Still the big issue seems to be the way get_signal is connected into > these threads so that it keeps getting called. Calling get_signal after > a fatal signal has been returned happens nowhere else and even if we fix > it today it is likely to lead to bugs in the future because whoever is > testing and updating the code is unlikely they have a vhost test case > the care about. > > diff --git a/kernel/signal.c b/kernel/signal.c > index 8f6330f0e9ca..4d54718cad36 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t) > > void recalc_sigpending(void) > { > - if (!recalc_sigpending_tsk(current) && !freezing(current)) > + if ((!recalc_sigpending_tsk(current) && !freezing(current)) || > + ((current->signal->flags & SIGNAL_GROUP_EXIT) && > + !__fatal_signal_pending(current))) > clear_thread_flag(TIF_SIGPENDING); > > } > @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct > task_struct *p, enum pid_type type) > * This signal will be fatal to the whole group. > */ > if (!sig_kernel_coredump(sig)) { > + /* > +* The signal is being short circuit delivered > +* don't it pending. > +*/ > + if (type != PIDTYPE_PID) { > + sigdelset(&t->signal->shared_pending, sig); > + > /* > * Start a group exit and wake everybody up. > * This way we don't have other threads > If I change up your patch so the last part is moved down a bit to when we set t like this: diff --git a/kernel/signal.c b/kernel/signal.c index 0ac48c96ab04..c976a80650db 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t) void recalc_sigpending(void) { - if (!recalc_sigpending_tsk(current) && !freezing(current)) + if ((!recalc_sigpending_tsk(current) && !freezing(current)) || + ((current->signal->flags & SIGNAL_GROUP_EXIT) && +!__fatal_signal_pending(current))) clear_thread_flag(TIF_SIGPENDING); - } EXPORT_SYMBOL(recalc_sigpending); @@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct task_struct *p, enum pid_type type) signal->group_exit_code = sig; signal->group_stop_count = 0; t = p; + /* +* The signal is being short circuit delivered +* don't it pending. +*/ + if (type != PIDTYPE_PID) { + struct sigpending *pending; + + pending = &t->signal->shared_pending; + sigdelset(&pending->signal, sig); + } + do { task_clear_jobctl_pending(t, JOBCTL_PENDING_MASK); sigaddset(&t->pending.signal, SIGKILL); Then get_signal() works like how Oleg mentioned it should earlier. For vhost I just need the code below which is just Linus's patch plus a call to get_signal() in vhost_worker() and the PF_IO_WORKER->PF_USER_WORKER change. Note that when we get SIGKILL, the vhost file_operations->release function is called via do_exit -> exit_files -> put_files_struct -> close_files and so the vhost release function starts to flush IO and stop the worker/vhost task. In vhost_worker() then we just handle those last completions for already running IO. When the vhost release function detects they are done it does vhost_task_stop() and vhost_worker() returns and then vhost_task_fn() does do_exit(). So we don't return immediately when get_signal() returns non-zero. So it works, but it sounds like you don't like vhost relying on the behavior, and it's non standard to use get_signal() like we are. So I'm not sure how we want to proceed. Maybe the safest is to revert: commit 6e890c5d5021ca7e69bbe203fde42447874d9a82 Author: Mike Christie Date: Fri Mar 10 16:03:32 2023 -0600 vhost: use vhost_tasks for worker threads and retry this for the next kernel when we can do proper testing and more code review? diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..1ba9e068b2ab 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -349,8 +349,16 @@ static int vhost
Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
On 5/18/23 11:25 AM, Oleg Nesterov wrote: > I too do not understand the 1st change in this patch ... > > On 05/18, Mike Christie wrote: >> >> In the other patches we do: >> >> if (get_signal(ksig)) >> start_exit_cleanup_by_stopping_newIO() >> flush running IO() >> exit() >> >> But to do the flush running IO() part of this I need to wait for it so >> that's why I wanted to be able to dequeue the SIGKILL and clear the >> TIF_SIGPENDING bit. > > But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ? > > if ((signal->flags & SIGNAL_GROUP_EXIT) || >signal->group_exec_task) { > clear_siginfo(&ksig->info); > ksig->info.si_signo = signr = SIGKILL; > sigdelset(¤t->pending.signal, SIGKILL); > > this "dequeues" SIGKILL, > > trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO, > &sighand->action[SIGKILL - 1]); > recalc_sigpending(); > > this clears TIF_SIGPENDING. > I see what you guys meant. TIF_SIGPENDING isn't getting cleared. I'll dig into why. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
On 5/18/23 3:08 AM, Christian Brauner wrote: > On Wed, May 17, 2023 at 07:09:13PM -0500, Mike Christie wrote: >> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is >> set when we are dealing with PF_USER_WORKER tasks. >> >> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight. >> We can easily stop new work/IO from being queued to the vhost_task, but >> for IO that's already been sent to something like the block layer we >> need to wait for the response then process it. These type of IO >> completions use the vhost_task to process the completion so we can't >> exit immediately. >> >> We need to handle wait for then handle those completions from the >> vhost_task, but when we have a SIGKLL pending, functions like >> schedule() return immediately so we can't wait like normal. Functions >> like vhost_worker() degrade to just a while(1); loop. >> >> This patch has get_signal drop down to the normal code path when >> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect >> there is a SIGKILL but still perform some blocking cleanup. >> >> Note that in that chunk I'm now bypassing that does: >> >> sigdelset(¤t->pending.signal, SIGKILL); >> >> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/ >> group_exec_task we are already doing that on the threads in the >> group. >> >> Signed-off-by: Mike Christie >> --- > > I think you just got confused by the original discussion that was split > into two separate threads: > > (1) The discussion based on your original proposal to adjust the signal > handling logic to accommodate vhost workers as they are right now. > That's where Oleg jumped in. > (2) My request - which you did in this series - of rewriting vhost > workers to behave more like io_uring workers. > > Both problems are orthogonal. The gist of my proposal is to avoid (1) by > doing (2). So the only change that's needed is > s/PF_IO_WORKER/PF_USER_WORKER/ which is pretty obvious as io_uring > workers and vhost workers no almost fully collapse into the same > concept. > > So forget (1). If additional signal patches are needed as discussed in > (1) then it must be because of a bug that would affect io_uring workers > today. I maybe didn't exactly misunderstand you. I did patch 1/8 to show issues I hit when I'm doing 2-8. See my reply to Eric's question about what I'm hitting and why the last part of the patch only did not work for me: https://lore.kernel.org/lkml/20230518000920.191583-2-michael.chris...@oracle.com/T/#mc6286d1a42c79761248ba55f1dd7a433379be6d1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
On 5/17/23 10:49 PM, Eric W. Biederman wrote: > > Long story short. > > In the patch below the first hunk is a noop. > > The code you are bypassing was added to ensure that process termination > (aka SIGKILL) is processed before any other signals. Other than signal > processing order there are not any substantive differences in the two > code paths. With all signals except SIGSTOP == 19 and SIGKILL == 9 > blocked SIGKILL should always be processed before SIGSTOP. > > Can you try patch with just the last hunk that does > s/PF_IO_WORKER/PF_USER_WORKER/ and see if that is enough? > If I just have the last hunk and then we get SIGKILL what happens is in code like: vhost_worker() schedule() if (has IO) handle_IO() The schedule() calls will hit the signal_pending_state check for signal_pending or __fatal_signal_pending and so instead of waiting for whatever wake_up call we normally waited for we tend to just return immediately. If you just run Qemu (the parent of the vhost_task) and send SIGKILL then sometimes the vhost_task just spins and it would look like the task has taken over the CPU (this is what I hit when I tested Linus's patch). With the first hunk of the patch, we will end up dequeuing the SIGKILL and clearing TIF_SIGPENDING, so the vhost_task can still do some work before it exits. In the other patches we do: if (get_signal(ksig)) start_exit_cleanup_by_stopping_newIO() flush running IO() exit() But to do the flush running IO() part of this I need to wait for it so that's why I wanted to be able to dequeue the SIGKILL and clear the TIF_SIGPENDING bit. Or I don't need this specifically. In patch 0/8 I said I knew you guys would not like it :) If I just have a: if (fatal_signal()) clear_fatal_signal() then it would work for me. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
On 5/18/23 9:18 AM, Christian Brauner wrote: >> @@ -352,12 +353,13 @@ static int vhost_worker(void *data) >> if (!node) { >> schedule(); >> /* >> - * When we get a SIGKILL our release function will >> - * be called. That will stop new IOs from being queued >> - * and check for outstanding cmd responses. It will then >> - * call vhost_task_stop to exit us. >> + * When we get a SIGKILL we kick off a work to >> + * run the driver's helper to stop new work and >> + * handle completions. When they are done they will >> + * call vhost_task_stop to tell us to exit. >> */ >> -vhost_task_get_signal(); >> +if (vhost_task_get_signal()) >> +schedule_work(&dev->destroy_worker); >> } > > I'm pretty sure you still need to actually call exit here. Basically > mirror what's done in io_worker_exit() minus the io specific bits. We do call do_exit(). Once destory_worker has flushed the device and all outstanding IO has completed it call vhost_task_stop(). vhost_worker() above then breaks out of the loop and returns and vhost_task_fn() does do_exit(). ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 8/8] fork/vhost_task: remove no_files
On 5/17/23 7:09 PM, Mike Christie wrote: > + CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND, Sorry. I tried to throw this one in last second so we could see that we can also see that we can now use CLONE_FILES like io_uring. It will of course not compile. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
On 5/17/23 7:16 PM, Linus Torvalds wrote: > On Wed, May 17, 2023 at 5:09 PM Mike Christie > wrote: >> >> + __set_current_state(TASK_RUNNING); >> + rc = get_signal(&ksig); >> + set_current_state(TASK_INTERRUPTIBLE); >> + return rc; > > The games with current_state seem nonsensical. > > What are they all about? get_signal() shouldn't care, and no other > caller does this thing. This just seems completely random. Sorry. It's a leftover. I was originally calling this from vhost_task_should_stop where before calling that function we do a: set_current_state(TASK_INTERRUPTIBLE); So, I was hitting get_signal->try_to_freeze->might_sleep->__might_sleep and was getting the "do not call blocking ops when !TASK_RUNNING" warnings. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 8/8] fork/vhost_task: remove no_files
The vhost_task can now support the worker being freed from under the device when we get a SIGKILL or the process exits without closing devices. We no longer need no_files so this removes it. Signed-off-by: Mike Christie --- include/linux/sched/task.h | 1 - kernel/fork.c | 10 ++ kernel/vhost_task.c| 3 +-- 3 files changed, 3 insertions(+), 11 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 249a5ece9def..342fe297ffd4 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -28,7 +28,6 @@ struct kernel_clone_args { u32 kthread:1; u32 io_thread:1; u32 user_worker:1; - u32 no_files:1; u32 block_signals:1; unsigned long stack; unsigned long stack_size; diff --git a/kernel/fork.c b/kernel/fork.c index 9e04ab5c3946..f2c081c15efb 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -1769,8 +1769,7 @@ static int copy_fs(unsigned long clone_flags, struct task_struct *tsk) return 0; } -static int copy_files(unsigned long clone_flags, struct task_struct *tsk, - int no_files) +static int copy_files(unsigned long clone_flags, struct task_struct *tsk) { struct files_struct *oldf, *newf; int error = 0; @@ -1782,11 +1781,6 @@ static int copy_files(unsigned long clone_flags, struct task_struct *tsk, if (!oldf) goto out; - if (no_files) { - tsk->files = NULL; - goto out; - } - if (clone_flags & CLONE_FILES) { atomic_inc(&oldf->count); goto out; @@ -2488,7 +2482,7 @@ __latent_entropy struct task_struct *copy_process( retval = copy_semundo(clone_flags, p); if (retval) goto bad_fork_cleanup_security; - retval = copy_files(clone_flags, p, args->no_files); + retval = copy_files(clone_flags, p); if (retval) goto bad_fork_cleanup_semundo; retval = copy_fs(clone_flags, p); diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index a11f036290cc..642047765190 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -96,12 +96,11 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, { struct kernel_clone_args args = { .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | - CLONE_THREAD | CLONE_SIGHAND, + CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND, .exit_signal= 0, .fn = vhost_task_fn, .name = name, .user_worker= 1, - .no_files = 1, .block_signals = 1, }; struct vhost_task *vtsk; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works
This moves the scsi code we use to stop new works from being queued and wait on running works to a helper which is used by the vhost layer when the vhost_task is being killed by a SIGKILL. Signed-off-by: Mike Christie --- drivers/vhost/scsi.c | 23 +++ 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index 40f9135e1a62..a0f2588270f2 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1768,6 +1768,19 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, u64 features) return 0; } +static void vhost_scsi_stop_dev_work(struct vhost_dev *dev) +{ + struct vhost_scsi *vs = container_of(dev, struct vhost_scsi, dev); + struct vhost_scsi_target t; + + mutex_lock(&vs->dev.mutex); + memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn)); + mutex_unlock(&vs->dev.mutex); + vhost_scsi_clear_endpoint(vs, &t); + vhost_dev_stop(&vs->dev); + vhost_dev_cleanup(&vs->dev); +} + static int vhost_scsi_open(struct inode *inode, struct file *f) { struct vhost_scsi *vs; @@ -1821,7 +1834,7 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick; } vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0, - true, NULL, NULL); + true, NULL, vhost_scsi_stop_dev_work); vhost_scsi_init_inflight(vs, NULL); @@ -1843,14 +1856,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) static int vhost_scsi_release(struct inode *inode, struct file *f) { struct vhost_scsi *vs = f->private_data; - struct vhost_scsi_target t; - mutex_lock(&vs->dev.mutex); - memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn)); - mutex_unlock(&vs->dev.mutex); - vhost_scsi_clear_endpoint(vs, &t); - vhost_dev_stop(&vs->dev); - vhost_dev_cleanup(&vs->dev); + vhost_dev_stop_work(&vs->dev); kfree(vs->dev.vqs); kfree(vs->vqs); kfree(vs->old_inflight); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 7/8] vhost-net: Add callback to stop and wait on works
This moves the net code we use to stop new works from being queued and wait on running works to a helper which is used by the vhost layer when the vhost_task is being killed by a SIGKILL. Signed-off-by: Mike Christie --- drivers/vhost/net.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 90c25127b3f8..f8a5527b15ba 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1325,9 +1325,9 @@ static void vhost_net_flush(struct vhost_net *n) } } -static int vhost_net_release(struct inode *inode, struct file *f) +static void vhost_net_stop_dev_work(struct vhost_dev *dev) { - struct vhost_net *n = f->private_data; + struct vhost_net *n = container_of(dev, struct vhost_net, dev); struct socket *tx_sock; struct socket *rx_sock; @@ -1345,6 +1345,13 @@ static int vhost_net_release(struct inode *inode, struct file *f) /* We do an extra flush before freeing memory, * since jobs can re-queue themselves. */ vhost_net_flush(n); +} + +static int vhost_net_release(struct inode *inode, struct file *f) +{ + struct vhost_net *n = f->private_data; + + vhost_dev_stop_work(&n->dev); kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue); kfree(n->vqs[VHOST_NET_VQ_TX].xdp); kfree(n->dev.vqs); @@ -1409,7 +1416,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, UIO_MAXIOV + VHOST_NET_BATCH, VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, - NULL, NULL); + NULL, vhost_net_stop_dev_work); vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev); vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones
When the vhost_task gets a SIGKILL we want to stop new work from being queued and also wait for and handle completions for running work. For the latter, we still need to use the vhost_task to handle the completing work so we can't just exit right away. But, this has us kick off the stopping and flushing/stopping of the device/vhost_task/worker to the system work queue while the vhost_task handles completions. When all completions are done we will then do vhost_task_stop and we will exit. Signed-off-by: Mike Christie --- drivers/vhost/net.c | 2 +- drivers/vhost/scsi.c | 4 ++-- drivers/vhost/test.c | 3 ++- drivers/vhost/vdpa.c | 2 +- drivers/vhost/vhost.c | 48 --- drivers/vhost/vhost.h | 10 - drivers/vhost/vsock.c | 4 ++-- 7 files changed, 58 insertions(+), 15 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 8557072ff05e..90c25127b3f8 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1409,7 +1409,7 @@ static int vhost_net_open(struct inode *inode, struct file *f) vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, UIO_MAXIOV + VHOST_NET_BATCH, VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, - NULL); + NULL, NULL); vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev); vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev); diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c index bb10fa4bb4f6..40f9135e1a62 100644 --- a/drivers/vhost/scsi.c +++ b/drivers/vhost/scsi.c @@ -1820,8 +1820,8 @@ static int vhost_scsi_open(struct inode *inode, struct file *f) vqs[i] = &vs->vqs[i].vq; vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick; } - vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, - VHOST_SCSI_WEIGHT, 0, true, NULL); + vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0, + true, NULL, NULL); vhost_scsi_init_inflight(vs, NULL); diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c index 42c955a5b211..11a2823d7532 100644 --- a/drivers/vhost/test.c +++ b/drivers/vhost/test.c @@ -120,7 +120,8 @@ static int vhost_test_open(struct inode *inode, struct file *f) vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ]; n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick; vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV, - VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL); + VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL, + NULL); f->private_data = n; diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c index 8c1aefc865f0..de9a83ecb70d 100644 --- a/drivers/vhost/vdpa.c +++ b/drivers/vhost/vdpa.c @@ -1279,7 +1279,7 @@ static int vhost_vdpa_open(struct inode *inode, struct file *filep) vqs[i]->handle_kick = handle_vq_kick; } vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false, - vhost_vdpa_process_iotlb_msg); + vhost_vdpa_process_iotlb_msg, NULL); r = vhost_vdpa_alloc_domain(v); if (r) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 1ba9e068b2ab..4163c86db50c 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -336,6 +336,7 @@ static void vhost_vq_reset(struct vhost_dev *dev, static int vhost_worker(void *data) { struct vhost_worker *worker = data; + struct vhost_dev *dev = worker->dev; struct vhost_work *work, *work_next; struct llist_node *node; @@ -352,12 +353,13 @@ static int vhost_worker(void *data) if (!node) { schedule(); /* -* When we get a SIGKILL our release function will -* be called. That will stop new IOs from being queued -* and check for outstanding cmd responses. It will then -* call vhost_task_stop to exit us. +* When we get a SIGKILL we kick off a work to +* run the driver's helper to stop new work and +* handle completions. When they are done they will +* call vhost_task_stop to tell us to exit. */ - vhost_task_get_signal(); + if (vhost_task_get_signal()) + schedule_work(&dev->destroy_worker); } node = llist_reverse_order(node); @@ -376,6 +378,33 @@ static int vhost_worker(void *data) return 0; } +static void __vhost_dev_stop_work(struct vhost_dev *dev) +{ + mutex_lock(&dev->stop_work_mutex); +
[RFC PATCH 4/8] vhost-net: Move vhost_net_open
This moves vhost_net_open so in the next patches we can pass vhost_dev_init a new helper which will use the stop/flush functions. There is no functionality changes in this patch. Signed-off-by: Mike Christie --- drivers/vhost/net.c | 134 ++-- 1 file changed, 67 insertions(+), 67 deletions(-) diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index 07181cd8d52e..8557072ff05e 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -1285,73 +1285,6 @@ static void handle_rx_net(struct vhost_work *work) handle_rx(net); } -static int vhost_net_open(struct inode *inode, struct file *f) -{ - struct vhost_net *n; - struct vhost_dev *dev; - struct vhost_virtqueue **vqs; - void **queue; - struct xdp_buff *xdp; - int i; - - n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL); - if (!n) - return -ENOMEM; - vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL); - if (!vqs) { - kvfree(n); - return -ENOMEM; - } - - queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *), - GFP_KERNEL); - if (!queue) { - kfree(vqs); - kvfree(n); - return -ENOMEM; - } - n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue; - - xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL); - if (!xdp) { - kfree(vqs); - kvfree(n); - kfree(queue); - return -ENOMEM; - } - n->vqs[VHOST_NET_VQ_TX].xdp = xdp; - - dev = &n->dev; - vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq; - vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq; - n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick; - n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick; - for (i = 0; i < VHOST_NET_VQ_MAX; i++) { - n->vqs[i].ubufs = NULL; - n->vqs[i].ubuf_info = NULL; - n->vqs[i].upend_idx = 0; - n->vqs[i].done_idx = 0; - n->vqs[i].batched_xdp = 0; - n->vqs[i].vhost_hlen = 0; - n->vqs[i].sock_hlen = 0; - n->vqs[i].rx_ring = NULL; - vhost_net_buf_init(&n->vqs[i].rxq); - } - vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX, - UIO_MAXIOV + VHOST_NET_BATCH, - VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true, - NULL); - - vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev); - vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev); - - f->private_data = n; - n->page_frag.page = NULL; - n->refcnt_bias = 0; - - return 0; -} - static struct socket *vhost_net_stop_vq(struct vhost_net *n, struct vhost_virtqueue *vq) { @@ -1421,6 +1354,73 @@ static int vhost_net_release(struct inode *inode, struct file *f) return 0; } +static int vhost_net_open(struct inode *inode, struct file *f) +{ + struct vhost_net *n; + struct vhost_dev *dev; + struct vhost_virtqueue **vqs; + void **queue; + struct xdp_buff *xdp; + int i; + + n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL); + if (!n) + return -ENOMEM; + vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL); + if (!vqs) { + kvfree(n); + return -ENOMEM; + } + + queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *), + GFP_KERNEL); + if (!queue) { + kfree(vqs); + kvfree(n); + return -ENOMEM; + } + n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue; + + xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL); + if (!xdp) { + kfree(vqs); + kvfree(n); + kfree(queue); + return -ENOMEM; + } + n->vqs[VHOST_NET_VQ_TX].xdp = xdp; + + dev = &n->dev; + vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq; + vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq; + n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick; + n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick; + for (i = 0; i < VHOST_NET_VQ_MAX; i++) { + n->vqs[i].ubufs = NULL; + n->vqs[i].ubuf_info = NULL; + n->vqs[i].upend_idx = 0; + n->vqs[i].done_idx = 0; + n->vqs[i].batched_xdp = 0; + n->vqs[i].vhost_hlen = 0; + n->vqs[i].sock_hlen = 0; + n->vqs[i].rx_ring = NULL; + vhost_
[RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set
This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set when we are dealing with PF_USER_WORKER tasks. When a vhost_task gets a SIGKILL, we could have outstanding IO in flight. We can easily stop new work/IO from being queued to the vhost_task, but for IO that's already been sent to something like the block layer we need to wait for the response then process it. These type of IO completions use the vhost_task to process the completion so we can't exit immediately. We need to handle wait for then handle those completions from the vhost_task, but when we have a SIGKLL pending, functions like schedule() return immediately so we can't wait like normal. Functions like vhost_worker() degrade to just a while(1); loop. This patch has get_signal drop down to the normal code path when SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect there is a SIGKILL but still perform some blocking cleanup. Note that in that chunk I'm now bypassing that does: sigdelset(¤t->pending.signal, SIGKILL); we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/ group_exec_task we are already doing that on the threads in the group. Signed-off-by: Mike Christie --- kernel/signal.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/kernel/signal.c b/kernel/signal.c index 8f6330f0e9ca..ae4972eea5db 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig) struct k_sigaction *ka; enum pid_type type; - /* Has this task already been marked for death? */ - if ((signal->flags & SIGNAL_GROUP_EXIT) || -signal->group_exec_task) { + /* +* Has this task already been marked for death? +* +* If this is a PF_USER_WORKER then the task may need to do +* extra work that requires waiting on running work, so we want +* to dequeue the signal below and tell the caller its time to +* start its exit procedure. When the work has completed then +* the task will exit. +*/ + if (!(current->flags & PF_USER_WORKER) && + ((signal->flags & SIGNAL_GROUP_EXIT) || +signal->group_exec_task)) { clear_siginfo(&ksig->info); ksig->info.si_signo = signr = SIGKILL; sigdelset(¤t->pending.signal, SIGKILL); @@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig) } /* -* PF_IO_WORKER threads will catch and exit on fatal signals +* PF_USER_WORKER threads will catch and exit on fatal signals * themselves. They have cleanup that must be performed, so * we cannot call do_exit() on their behalf. */ - if (current->flags & PF_IO_WORKER) + if (current->flags & PF_USER_WORKER) goto out; /* -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler
This patch has vhost use get_signal to handle freezing and sort of handle signals. By the latter I mean that when we get SIGKILL, our parent will exit and call our file_operatons release function. That will then stop new work from breing queued and wait for the vhost_task to handle completions for running IO. We then exit when those are done. The next patches will then have us work more like io_uring where we handle the get_signal return value and key off that to cleanup. Signed-off-by: Mike Christie --- drivers/vhost/vhost.c| 10 +- include/linux/sched/vhost_task.h | 1 + kernel/vhost_task.c | 20 3 files changed, 30 insertions(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index a92af08e7864..1ba9e068b2ab 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -349,8 +349,16 @@ static int vhost_worker(void *data) } node = llist_del_all(&worker->work_list); - if (!node) + if (!node) { schedule(); + /* +* When we get a SIGKILL our release function will +* be called. That will stop new IOs from being queued +* and check for outstanding cmd responses. It will then +* call vhost_task_stop to exit us. +*/ + vhost_task_get_signal(); + } node = llist_reverse_order(node); /* make sure flag is seen after deletion */ diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h index 6123c10b99cf..54b68115eb3b 100644 --- a/include/linux/sched/vhost_task.h +++ b/include/linux/sched/vhost_task.h @@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, void vhost_task_start(struct vhost_task *vtsk); void vhost_task_stop(struct vhost_task *vtsk); bool vhost_task_should_stop(struct vhost_task *vtsk); +bool vhost_task_get_signal(void); #endif diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index b7cbd66f889e..a661cfa32ba3 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -61,6 +61,26 @@ bool vhost_task_should_stop(struct vhost_task *vtsk) } EXPORT_SYMBOL_GPL(vhost_task_should_stop); +/** + * vhost_task_get_signal - Check if there are pending signals + * + * Return true if we got SIGKILL. + */ +bool vhost_task_get_signal(void) +{ + struct ksignal ksig; + bool rc; + + if (!signal_pending(current)) + return false; + + __set_current_state(TASK_RUNNING); + rc = get_signal(&ksig); + set_current_state(TASK_INTERRUPTIBLE); + return rc; +} +EXPORT_SYMBOL_GPL(vhost_task_get_signal); + /** * vhost_task_create - create a copy of a process to be used by the kernel * @fn: thread stack -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND
This is a modified version of Linus's patch which has vhost_task use CLONE_THREAD and CLONE_SIGHAND and allow SIGKILL and SIGSTOP. I renamed the ignore_signals to block_signals based on Linus's comment where it aligns with what we are doing with the siginitsetinv p->blocked use and no longer calling ignore_signals. Signed-off-by: Mike Christie --- include/linux/sched/task.h | 2 +- kernel/fork.c | 12 +++- kernel/vhost_task.c| 5 +++-- 3 files changed, 7 insertions(+), 12 deletions(-) diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h index 537cbf9a2ade..249a5ece9def 100644 --- a/include/linux/sched/task.h +++ b/include/linux/sched/task.h @@ -29,7 +29,7 @@ struct kernel_clone_args { u32 io_thread:1; u32 user_worker:1; u32 no_files:1; - u32 ignore_signals:1; + u32 block_signals:1; unsigned long stack; unsigned long stack_size; unsigned long tls; diff --git a/kernel/fork.c b/kernel/fork.c index ed4e01daccaa..9e04ab5c3946 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process( p->flags |= PF_KTHREAD; if (args->user_worker) p->flags |= PF_USER_WORKER; - if (args->io_thread) { - /* -* Mark us an IO worker, and block any signal that isn't -* fatal or STOP -*/ + if (args->io_thread) p->flags |= PF_IO_WORKER; + if (args->block_signals) siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP)); - } if (args->name) strscpy_pad(p->comm, args->name, sizeof(p->comm)); @@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process( if (retval) goto bad_fork_cleanup_io; - if (args->ignore_signals) - ignore_signals(p); - stackleak_task_init(p); if (pid != &init_struct_pid) { @@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), void *arg, int node) .fn_arg = arg, .io_thread = 1, .user_worker= 1, + .block_signals = 1, }; return copy_process(NULL, 0, node, &args); diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c index a661cfa32ba3..a11f036290cc 100644 --- a/kernel/vhost_task.c +++ b/kernel/vhost_task.c @@ -95,13 +95,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void *arg, const char *name) { struct kernel_clone_args args = { - .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM, + .flags = CLONE_FS | CLONE_UNTRACED | CLONE_VM | + CLONE_THREAD | CLONE_SIGHAND, .exit_signal= 0, .fn = vhost_task_fn, .name = name, .user_worker= 1, .no_files = 1, - .ignore_signals = 1, + .block_signals = 1, }; struct vhost_task *vtsk; struct task_struct *tsk; -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization