from:"Mike Christie"

Re: [PATCH v11 8/8] vhost: use vhost_tasks for worker threads

2023-08-11 Thread Mike Christie

On 8/10/23 1:57 PM, Michael S. Tsirkin wrote:
> On Sat, Jul 22, 2023 at 11:03:29PM -0500, michael.chris...@oracle.com wrote:
>> On 7/20/23 8:06 AM, Michael S. Tsirkin wrote:
>>> On Thu, Feb 02, 2023 at 05:25:17PM -0600, Mike Christie wrote:
>>>> For vhost workers we use the kthread API which inherit's its values from
>>>> and checks against the kthreadd thread. This results in the wrong RLIMITs
>>>> being checked, so while tools like libvirt try to control the number of
>>>> threads based on the nproc rlimit setting we can end up creating more
>>>> threads than the user wanted.
>>>>
>>>> This patch has us use the vhost_task helpers which will inherit its
>>>> values/checks from the thread that owns the device similar to if we did
>>>> a clone in userspace. The vhost threads will now be counted in the nproc
>>>> rlimits. And we get features like cgroups and mm sharing automatically,
>>>> so we can remove those calls.
>>>>
>>>> Signed-off-by: Mike Christie 
>>>> Acked-by: Michael S. Tsirkin 
>>>
>>>
>>> Hi Mike,
>>> So this seems to have caused a measureable regression in networking
>>> performance (about 30%). Take a look here, and there's a zip file
>>> with detailed measuraments attached:
>>>
>>> https://bugzilla.redhat.com/show_bug.cgi?id=603
>>>
>>>
>>> Could you take a look please?
>>> You can also ask reporter questions there assuming you
>>> have or can create a (free) account.
>>>
>>
>> Sorry for the late reply. I just got home from vacation.
>>
>> The account creation link seems to be down. I keep getting a
>> "unable to establish SMTP connection to bz-exim-prod port 25 " error.
>>
>> Can you give me Quan's email?
>>
>> I think I can replicate the problem. I just need some extra info from Quan:
>>
>> 1. Just double check that they are using RHEL 9 on the host running the VMs.
>> 2. The kernel config
>> 3. Any tuning that was done. Is tuned running in guest and/or host running 
>> the
>> VMs and what profile is being used in each.
>> 4. Number of vCPUs and virtqueues being used.
>> 5. Can they dump the contents of:
>>
>> /sys/kernel/debug/sched
>>
>> and
>>
>> sysctl  -a
>>
>> on the host running the VMs.
>>
>> 6. With the 6.4 kernel, can they also run a quick test and tell me if they 
>> set
>> the scheduler to batch:
>>
>> ps -T -o comm,pid,tid $QEMU_THREAD
>>
>> then for each vhost thread do:
>>
>> chrt -b -p 0 $VHOST_THREAD
>>
>> Does that end up increasing perf? When I do this I see throughput go up by
>> around 50% vs 6.3 when sessions was 16 or more (16 was the number of vCPUs
>> and virtqueues per net device in the VM). Note that I'm not saying that is a 
>> fix.
>> It's just a difference I noticed when running some other tests.
> 
> 
> Mike I'm unsure what to do at this point. Regressions are not nice
> but if the kernel is released with the new userspace api we won't
> be able to revert. So what's the plan?
> 

I'm sort of stumped. I still can't replicate the problem out of the box. 6.3 and
6.4 perform the same for me. I've tried your setup and settings and with 
different
combos of using things like tuned and irqbalance.

I can sort of force the issue. In 6.4, the vhost thread inherits it's settings
from the parent thread. In 6.3, the vhost thread inherits from kthreadd and we
would then reset the sched settings. So in 6.4 if I just tune the parent 
differently
I can cause different performance. If we want the 6.3 behavior we can do the 
patch
below.

However, I don't think you guys are hitting this because you are just running
qemu from the normal shell and were not doing anything fancy with the sched
settings.


diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index da35e5b7f047..f2c2638d1106 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -2,6 +2,7 @@
 /*
  * Copyright (C) 2021 Oracle Corporation
  */
+#include 
 #include 
 #include 
 #include 
@@ -22,9 +23,16 @@ struct vhost_task {
 
 static int vhost_task_fn(void *data)
 {
+   static const struct sched_param param = { .sched_priority = 0 };
struct vhost_task *vtsk = data;
bool dead = false;
 
+   /*
+* Don't inherit the parent's sched info, so we maintain compat from
+* when we used kthreads and it reset this info.
+*/
+   sched_setscheduler_nocheck(current, SCHED_NORMAL, ¶m);
+
for (;;) {
bool did_work;
 






___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 1/1] MAINTAINERS: add vhost-scsi entry and myself as a co-maintainer

2023-07-15 Thread Mike Christie

I've been doing a lot of the development on vhost-scsi the last couple of
years, so per Michael T's suggestion this adds me as co-maintainer.

Signed-off-by: Mike Christie 
---
 MAINTAINERS | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3be1bdfe8ecc..2c4a8a860ae0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -22458,7 +22458,6 @@ L:  virtualization@lists.linux-foundation.org
 S: Maintained
 F: drivers/block/virtio_blk.c
 F: drivers/scsi/virtio_scsi.c
-F: drivers/vhost/scsi.c
 F: include/uapi/linux/virtio_blk.h
 F: include/uapi/linux/virtio_scsi.h
 
@@ -22557,6 +22556,16 @@ F: include/linux/vhost_iotlb.h
 F: include/uapi/linux/vhost.h
 F: kernel/vhost_task.c
 
+VIRTIO HOST (VHOST-SCSI)
+M: "Michael S. Tsirkin" 
+M: Jason Wang 
+M: Mike Christie 
+R: Paolo Bonzini 
+R: Stefan Hajnoczi 
+L: virtualization@lists.linux-foundation.org
+S: Maintained
+F: drivers/vhost/scsi.c
+
 VIRTIO I2C DRIVER
 M: Conghui Chen 
 M: Viresh Kumar 
-- 
2.34.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 0/2] vhost-scsi: Fix IO hangs when using windows

2023-07-12 Thread Mike Christie

On 7/12/23 9:26 AM, Stefan Hajnoczi wrote:
> On Tue, Jul 11, 2023 at 04:01:22PM -0500, Mike Christie wrote:
>> On 7/11/23 1:34 PM, Stefan Hajnoczi wrote:
>>> On Sun, Jul 09, 2023 at 03:28:57PM -0500, Mike Christie wrote:
>>>> The following patches were made over Linus's tree and fix an issue
>>>> where windows guests will send iovecs with offset/lengths that result
>>>> in IOs that are not aligned to 512. The LIO layer will then send them
>>>> to Linux's FS/block layer but it requires 512 byte alignment, so
>>>> depending on the FS/block driver being used we will get IO errors or
>>>> hung IO.
>>>>
>>>> The following patches have vhost-scsi detect when windows sends these
>>>> IOs and copy them to a bounce buffer. It then does some cleanup in
>>>> the related code.
>>>
>>> Hang on, virtio-scsi is a SCSI HBA and READs/WRITEs submitted must
>>> follow the usual constraints on SCSI block limits. Would Windows send
>>> mis-aligned I/O to a non-virtio-scsi SCSI HBA?
>>
>> It's like linux where you can config settings like that.
>>
>>>> Are you sure this is not a bug in the Windows guest driver where block
>>> limits are being misconfigured?
>>
>> From what our windows dev told us the guest drivers like here:
>>
>> https://github.com/virtio-win
>>
>> don't set the windows AlignmentMask to 512. They tried that and it
>> resulted in windows crash dump crashing because it doesn't like the
>> hard alignment requirement.
>>
>> We thought other apps would have trouble as well, so we tried to add
>> bounce buffer support to the windows driver, but I think people thought
>> it was going to be uglier than this patch and in the normal alignment
>> case might also affect performance. There was some windows driver/layering
>> and buffer/cmd details that I don't fully understand and took their word
>> for because I don't know a lot about windows.
>>
>> In the end we still have to add checks to vhost-scsi to protect against
>> bad drivers, so we thought we might as well just add bounce buffer support
>> to vhost-scsi.
> 
> CCing virtio-win developers so they can confirm how the vioscsi driver
> is supposed to handle request alignment.
> 
> My expectation is that the virtio-scsi device will fail mis-aligned I/O
> requests.

I don't think you can just change the driver's behavior to fail now,
because apps send mis-aligned IO and its working as long as they have less
than 256 bio vecs.

We see mis-aligned IOs during boot and also from random non window's apps.
If we just start to fail then it would be a regression when the app no
longer works or the OS fails to start up.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 0/2] vhost-scsi: Fix IO hangs when using windows

2023-07-11 Thread Mike Christie

On 7/11/23 1:34 PM, Stefan Hajnoczi wrote:
> On Sun, Jul 09, 2023 at 03:28:57PM -0500, Mike Christie wrote:
>> The following patches were made over Linus's tree and fix an issue
>> where windows guests will send iovecs with offset/lengths that result
>> in IOs that are not aligned to 512. The LIO layer will then send them
>> to Linux's FS/block layer but it requires 512 byte alignment, so
>> depending on the FS/block driver being used we will get IO errors or
>> hung IO.
>>
>> The following patches have vhost-scsi detect when windows sends these
>> IOs and copy them to a bounce buffer. It then does some cleanup in
>> the related code.
> 
> Hang on, virtio-scsi is a SCSI HBA and READs/WRITEs submitted must
> follow the usual constraints on SCSI block limits. Would Windows send
> mis-aligned I/O to a non-virtio-scsi SCSI HBA?

It's like linux where you can config settings like that.

> > Are you sure this is not a bug in the Windows guest driver where block
> limits are being misconfigured?

>From what our windows dev told us the guest drivers like here:

https://github.com/virtio-win

don't set the windows AlignmentMask to 512. They tried that and it
resulted in windows crash dump crashing because it doesn't like the
hard alignment requirement.

We thought other apps would have trouble as well, so we tried to add
bounce buffer support to the windows driver, but I think people thought
it was going to be uglier than this patch and in the normal alignment
case might also affect performance. There was some windows driver/layering
and buffer/cmd details that I don't fully understand and took their word
for because I don't know a lot about windows.

In the end we still have to add checks to vhost-scsi to protect against
bad drivers, so we thought we might as well just add bounce buffer support
to vhost-scsi.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] Revert "virtio-scsi: Send "REPORTED LUNS CHANGED" sense data upon disk hotplug events"

2023-07-11 Thread Mike Christie

What was the issue you are seeing?

Was it something like you get the UA. We retry then on one of the
retries the sense is not setup correctly, so the scsi error handler
runs? That fails and the device goes offline?

If you turn on scsi debugging you would see:


[  335.445922] sd 0:0:0:0: [sda] tag#15 Add. Sense: Reported luns data has 
changed
[  335.445922] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445925] sd 0:0:0:0: [sda] tag#16 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445929] sd 0:0:0:0: [sda] tag#17 Done: FAILED Result: hostbyte=DID_OK 
driverbyte=DRIVER_OK cmd_age=0s
[  335.445932] sd 0:0:0:0: [sda] tag#17 CDB: Write(10) 2a 00 00 db 4f c0 00 00 
20 00
[  335.445934] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445936] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445938] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445940] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445942] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.445945] sd 0:0:0:0: [sda] tag#17 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00
[  335.451447] scsi host0: scsi_eh_0: waking up 0/2/2
[  335.451453] scsi host0: Total of 2 commands on 1 devices require eh work
[  335.451457] sd 0:0:0:0: [sda] tag#16 scsi_eh_0: requesting sense


I don't know the qemu scsi code well, but I scanned the code for my co-worker
and my guess was commit 8cc5583abe6419e7faaebc9fbd109f34f4c850f2 had a race in 
it.

How is locking done? when it is a bus level UA but there are multiple devices
on the bus?

Is it possible, devA is clearing the sense on devB. For example, thread1 for
devA  is doing scsi_clear_unit_attention but then thread2 for devB has seen that
bus->unit_attention so it set req ops to reqops_unit_attention. But when
we run reqops_unit_attention.send_command scsi_unit_attention does not see
req->bus->unit_attention set anymore so we get a CC with no sense.

If the linux kernel scsi layer sees a CC with no sense then we fire the SCSI
error handler like seen above in the logs.


On 7/11/23 12:06 PM, Stefano Garzarella wrote:
> CCing `./scripts/get_maintainer.pl -f drivers/scsi/virtio_scsi.c`,
> since I found a few things in the virtio-scsi driver...
> 
> FYI we have seen that Linux has problems with a QEMU patch for the
> virtio-scsi device (details at the bottom of this email in the revert
> commit message and BZ).
> 
> 
> This is what I found when I looked at the Linux code:
> 
> In scsi_report_sense() in linux/drivers/scsi/scsi_error.c linux calls
> scsi_report_lun_change() that set `sdev_target->expecting_lun_change =
> 1` when we receive a UNIT ATTENTION with REPORT LUNS CHANGED
> (sshdr->asc == 0x3f && sshdr->ascq == 0x0e).
> 
> When `sdev_target->expecting_lun_change = 1` is set and we call
> scsi_check_sense(), for example to check the next UNIT ATTENTION, it
> will return NEEDS_RETRY, that I think will cause the issues we are
> seeing.
> 
> `sdev_target->expecting_lun_change` is reset only in
> scsi_decide_disposition() when `REPORT_LUNS` command returns with
> SAM_STAT_GOOD.
> That command is issued in scsi_report_lun_scan() called by
> __scsi_scan_target(), called for example by scsi_scan_target(),
> scsi_scan_host(), etc.
> 
> So, checking QEMU, we send VIRTIO_SCSI_EVT_RESET_RESCAN during hotplug
> and VIRTIO_SCSI_EVT_RESET_REMOVED during hotunplug. In both cases now we
> send also the UNIT ATTENTION.
> 
> In the virtio-scsi driver, when we receive VIRTIO_SCSI_EVT_RESET_RESCAN
> (hotplug) we call scsi_scan_target() or scsi_add_device(). Both of them
> will call __scsi_scan_target() at some points, sending `REPORT_LUNS`
> command to the device. This does not happen for
> VIRTIO_SCSI_EVT_RESET_REMOVED (hotunplug). Indeed if I remove the
> UNIT ATTENTION from the hotunplug in QEMU, everything works well.
> 
> So, I tried to add a scan also for VIRTIO_SCSI_EVT_RESET_REMOVED:
> 
> diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
> index bd5633667d01..c57658a63097 100644
> --- a/drivers/scsi/virtio_scsi.c
> +++ b/drivers/scsi/virtio_scsi.c
> @@ -291,6 +291,7 @@ static void virtscsi_handle_transport_reset(struct 
> virtio_scsi *vscsi,
>     }
>     break;
>     case VIRTIO_SCSI_EVT_RESET_REMOVED:
> +   scsi_scan_host(shost);
>     sdev = scsi_device_lookup(shost, 0, target, lun);
>     if (sdev) {
>     scsi_remove_device(sdev);
> 
> This somehow helps, now linux only breaks if the plug/unplug frequency
> is really high. If I put a 5 second sleep between plug/unplug events, it
> doesn't break (at least for the duration of my test which has been
> running for about 30 minutes, before it used to break after about a
> minute).
> 
> Another thing I noticed is that in QEMU maybe we should set the UNIT
> ATTENTION fi

[PATCH v2 1/2] vhost-scsi: Fix alignment handling with windows

2023-07-09 Thread Mike Christie

The linux block layer requires bios/requests to have lengths with a 512
byte alignment. Some drivers/layers like dm-crypt and the directi IO code
will test for it and just fail. Other drivers like SCSI just assume the
requirement is met and will end up in infinte retry loops. The problem
for drivers like SCSI is that it uses functions like blk_rq_cur_sectors
and blk_rq_sectors which divide the request's length by 512. If there's
lefovers then it just gets dropped. But other code in the block/scsi
layer may use blk_rq_bytes/blk_rq_cur_bytes and end up thinking there is
still data left and try to retry the cmd. We can then end up getting
stuck in retry loops where part of the block/scsi thinks there is data
left, but other parts think we want to do IOs of zero length.

Linux will always check for alignment, but windows will not. When
vhost-scsi then translates the iovec it gets from a windows guest to a
scatterlist, we can end up with sg items where the sg->length is not
divisible by 512 due to the misaligned offset:

sg[0].offset = 255;
sg[0].length = 3841;
sg...
sg[N].offset = 0;
sg[N].length = 255;

When the lio backends then convert the SG to bios or other iovecs, we
end up sending them with the same misaligned values and can hit the
issues above.

This just has us drop down to allocating a temp page and copying the data
when we detect a misaligned buffer and the IO is large enough that it
will get split into multiple bad IOs.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 186 +--
 1 file changed, 161 insertions(+), 25 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index c83f7f043470..324e4b3846fa 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -25,6 +25,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -75,6 +77,9 @@ struct vhost_scsi_cmd {
u32 tvc_prot_sgl_count;
/* Saved unpacked SCSI LUN for vhost_scsi_target_queue_cmd() */
u32 tvc_lun;
+   u32 copied_iov:1;
+   const void *saved_iter_addr;
+   struct iov_iter saved_iter;
/* Pointer to the SGL formatted memory from virtio-scsi */
struct scatterlist *tvc_sgl;
struct scatterlist *tvc_prot_sgl;
@@ -328,8 +333,13 @@ static void vhost_scsi_release_cmd_res(struct se_cmd 
*se_cmd)
int i;
 
if (tv_cmd->tvc_sgl_count) {
-   for (i = 0; i < tv_cmd->tvc_sgl_count; i++)
-   put_page(sg_page(&tv_cmd->tvc_sgl[i]));
+   for (i = 0; i < tv_cmd->tvc_sgl_count; i++) {
+   if (tv_cmd->copied_iov)
+   __free_page(sg_page(&tv_cmd->tvc_sgl[i]));
+   else
+   put_page(sg_page(&tv_cmd->tvc_sgl[i]));
+   }
+   kfree(tv_cmd->saved_iter_addr);
}
if (tv_cmd->tvc_prot_sgl_count) {
for (i = 0; i < tv_cmd->tvc_prot_sgl_count; i++)
@@ -504,6 +514,28 @@ static void vhost_scsi_evt_work(struct vhost_work *work)
mutex_unlock(&vq->mutex);
 }
 
+static int vhost_scsi_copy_sgl_to_iov(struct vhost_scsi_cmd *cmd)
+{
+   struct iov_iter *iter = &cmd->saved_iter;
+   struct scatterlist *sg = cmd->tvc_sgl;
+   struct page *page;
+   size_t len;
+   int i;
+
+   for (i = 0; i < cmd->tvc_sgl_count; i++) {
+   page = sg_page(&sg[i]);
+   len = sg[i].length;
+
+   if (copy_page_to_iter(page, 0, len, iter) != len) {
+   pr_err("Could not copy data while handling misaligned 
cmd. Error %zu\n",
+  len);
+   return -1;
+   }
+   }
+
+   return 0;
+}
+
 /* Fill in status and signal that we are done processing this command
  *
  * This is scheduled in the vhost work queue so we are called with the owner
@@ -527,15 +559,20 @@ static void vhost_scsi_complete_cmd_work(struct 
vhost_work *work)
 
pr_debug("%s tv_cmd %p resid %u status %#02x\n", __func__,
cmd, se_cmd->residual_count, se_cmd->scsi_status);
-
memset(&v_rsp, 0, sizeof(v_rsp));
-   v_rsp.resid = cpu_to_vhost32(cmd->tvc_vq, 
se_cmd->residual_count);
-   /* TODO is status_qualifier field needed? */
-   v_rsp.status = se_cmd->scsi_status;
-   v_rsp.sense_len = cpu_to_vhost32(cmd->tvc_vq,
-se_cmd->scsi_sense_length);
-   memcpy(v_rsp.sense, cmd->tvc_sense_buf,
-  se_cmd->scsi_sense_length);
+
+   if (cmd->saved_iter_addr && vhost_scsi_copy_sgl_to_iov(cmd)) {
+   v_rsp.response = VIRTIO_SCSI_S_BAD_TARGET;
+

[PATCH v2 2/2] vhost-scsi: Rename vhost_scsi_iov_to_sgl

2023-07-09 Thread Mike Christie

Rename vhost_scsi_iov_to_sgl to vhost_scsi_map_iov_to_sgl so it matches
matches the naming style used for vhost_scsi_copy_iov_to_sgl.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 324e4b3846fa..abef0619c790 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -780,8 +780,8 @@ vhost_scsi_copy_iov_to_sgl(struct vhost_scsi_cmd *cmd, 
struct iov_iter *iter,
 }
 
 static int
-vhost_scsi_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter,
- struct scatterlist *sg, int sg_count, bool is_prot)
+vhost_scsi_map_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter,
+ struct scatterlist *sg, int sg_count, bool is_prot)
 {
struct scatterlist *p = sg;
size_t revert_bytes;
@@ -829,8 +829,9 @@ vhost_scsi_mapal(struct vhost_scsi_cmd *cmd,
pr_debug("%s prot_sg %p prot_sgl_count %u\n", __func__,
 cmd->tvc_prot_sgl, cmd->tvc_prot_sgl_count);
 
-   ret = vhost_scsi_iov_to_sgl(cmd, prot_iter, cmd->tvc_prot_sgl,
-   cmd->tvc_prot_sgl_count, true);
+   ret = vhost_scsi_map_iov_to_sgl(cmd, prot_iter,
+   cmd->tvc_prot_sgl,
+   cmd->tvc_prot_sgl_count, true);
if (ret < 0) {
cmd->tvc_prot_sgl_count = 0;
return ret;
@@ -846,8 +847,8 @@ vhost_scsi_mapal(struct vhost_scsi_cmd *cmd,
pr_debug("%s data_sg %p data_sgl_count %u\n", __func__,
  cmd->tvc_sgl, cmd->tvc_sgl_count);
 
-   ret = vhost_scsi_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl,
-   cmd->tvc_sgl_count, false);
+   ret = vhost_scsi_map_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl,
+   cmd->tvc_sgl_count, false);
if (ret == -EINVAL) {
sg_init_table(cmd->tvc_sgl, cmd->tvc_sgl_count);
ret = vhost_scsi_copy_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl,
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 0/2] vhost-scsi: Fix IO hangs when using windows

2023-07-09 Thread Mike Christie

The following patches were made over Linus's tree and fix an issue
where windows guests will send iovecs with offset/lengths that result
in IOs that are not aligned to 512. The LIO layer will then send them
to Linux's FS/block layer but it requires 512 byte alignment, so
depending on the FS/block driver being used we will get IO errors or
hung IO.

The following patches have vhost-scsi detect when windows sends these
IOs and copy them to a bounce buffer. It then does some cleanup in
the related code.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 16/17] vhost_scsi: add support for worker ioctls

2023-06-26 Thread Mike Christie

This has vhost-scsi support the worker ioctls by calling the
vhost_worker_ioctl helper.

With a single worker, the single thread becomes a bottlneck when trying
to use 3 or more virtqueues like:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128  --numjobs=3

With the patches and doing a worker per vq, we can scale to at least
16 vCPUs/vqs (that's my system limit) with the same command fio command
above with numjobs=16:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=64  --numjobs=16

which gives around 2002K IOPs.

Note that for testing I dropped depth to 64 above because the vhost/virt
layer supports only 1024 total commands per device. And the only tuning I
did was set LIO's emulate_pr to 0 to avoid LIO's PR lock in the main IO
path which becomes an issue at around 12 jobs/virtqueues.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 2c3cf487cc71..c83f7f043470 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1927,6 +1927,14 @@ vhost_scsi_ioctl(struct file *f,
if (copy_from_user(&features, featurep, sizeof features))
return -EFAULT;
return vhost_scsi_set_features(vs, features);
+   case VHOST_NEW_WORKER:
+   case VHOST_FREE_WORKER:
+   case VHOST_ATTACH_VRING_WORKER:
+   case VHOST_GET_VRING_WORKER:
+   mutex_lock(&vs->dev.mutex);
+   r = vhost_worker_ioctl(&vs->dev, ioctl, argp);
+   mutex_unlock(&vs->dev.mutex);
+   return r;
default:
mutex_lock(&vs->dev.mutex);
r = vhost_dev_ioctl(&vs->dev, ioctl, argp);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 13/17] vhost: add helper to parse userspace vring state/file

2023-06-26 Thread Mike Christie

The next patches add new vhost worker ioctls which will need to get a
vhost_virtqueue from a userspace struct which specifies the vq's index.
This moves the vhost_vring_ioctl code to do this to a helper so it can
be shared.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6cadbc6e6d11..12203d3893c5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -599,6 +599,27 @@ static struct vhost_worker *vhost_worker_create(struct 
vhost_dev *dev)
return NULL;
 }
 
+static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp,
+ struct vhost_virtqueue **vq, u32 *id)
+{
+   u32 __user *idxp = argp;
+   u32 idx;
+   long r;
+
+   r = get_user(idx, idxp);
+   if (r < 0)
+   return r;
+
+   if (idx >= dev->nvqs)
+   return -ENOBUFS;
+
+   idx = array_index_nospec(idx, dev->nvqs);
+
+   *vq = dev->vqs[idx];
+   *id = idx;
+   return 0;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
@@ -1618,21 +1639,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned 
int ioctl, void __user *arg
struct file *eventfp, *filep = NULL;
bool pollstart = false, pollstop = false;
struct eventfd_ctx *ctx = NULL;
-   u32 __user *idxp = argp;
struct vhost_virtqueue *vq;
struct vhost_vring_state s;
struct vhost_vring_file f;
u32 idx;
long r;
 
-   r = get_user(idx, idxp);
+   r = vhost_get_vq_from_user(d, argp, &vq, &idx);
if (r < 0)
return r;
-   if (idx >= d->nvqs)
-   return -ENOBUFS;
-
-   idx = array_index_nospec(idx, d->nvqs);
-   vq = d->vqs[idx];
 
if (ioctl == VHOST_SET_VRING_NUM ||
ioctl == VHOST_SET_VRING_ADDR) {
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 05/17] vhost: take worker or vq instead of dev for queueing

2023-06-26 Thread Mike Christie

This patch has the core work queueing function take a worker for when we
support multiple workers. It also adds a helper that takes a vq during
queueing so modules can control which vq/worker to queue work on.

This temp leaves vhost_work_queue. It will be removed when the drivers
are converted in the next patches.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 44 +++
 drivers/vhost/vhost.h |  1 +
 2 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index aafb23e12477..611e495eeb3c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -231,21 +231,10 @@ void vhost_poll_stop(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
-void vhost_dev_flush(struct vhost_dev *dev)
+static bool vhost_worker_queue(struct vhost_worker *worker,
+  struct vhost_work *work)
 {
-   struct vhost_flush_struct flush;
-
-   init_completion(&flush.wait_event);
-   vhost_work_init(&flush.work, vhost_flush_work);
-
-   if (vhost_work_queue(dev, &flush.work))
-   wait_for_completion(&flush.wait_event);
-}
-EXPORT_SYMBOL_GPL(vhost_dev_flush);
-
-bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
-{
-   if (!dev->worker)
+   if (!worker)
return false;
/*
 * vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb
@@ -257,14 +246,37 @@ bool vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 * sure it was not in the list.
 * test_and_set_bit() implies a memory barrier.
 */
-   llist_add(&work->node, &dev->worker->work_list);
-   vhost_task_wake(dev->worker->vtsk);
+   llist_add(&work->node, &worker->work_list);
+   vhost_task_wake(worker->vtsk);
}
 
return true;
 }
+
+bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+{
+   return vhost_worker_queue(dev->worker, work);
+}
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
+{
+   return vhost_worker_queue(vq->worker, work);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
+
+void vhost_dev_flush(struct vhost_dev *dev)
+{
+   struct vhost_flush_struct flush;
+
+   init_completion(&flush.wait_event);
+   vhost_work_init(&flush.work, vhost_flush_work);
+
+   if (vhost_work_queue(dev, &flush.work))
+   wait_for_completion(&flush.wait_event);
+}
+EXPORT_SYMBOL_GPL(vhost_dev_flush);
+
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_vq_has_work(struct vhost_virtqueue *vq)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d5728f986744..e2dd4558a1f9 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -198,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
+bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
 bool vhost_vq_has_work(struct vhost_virtqueue *vq);
 bool vhost_vq_is_setup(struct vhost_virtqueue *vq);
 int vhost_vq_init_access(struct vhost_virtqueue *);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 10/17] vhost_scsi: convert to vhost_vq_work_queue

2023-06-26 Thread Mike Christie

Convert from vhost_work_queue to vhost_vq_work_queue so we can
remove vhost_work_queue.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index a77c53bb035a..1668009bd489 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -353,8 +353,9 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) {
struct vhost_scsi_tmf *tmf = container_of(se_cmd,
struct vhost_scsi_tmf, se_cmd);
+   struct vhost_virtqueue *vq = &tmf->svq->vq;
 
-   vhost_work_queue(&tmf->vhost->dev, &tmf->vwork);
+   vhost_vq_work_queue(vq, &tmf->vwork);
} else {
struct vhost_scsi_cmd *cmd = container_of(se_cmd,
struct vhost_scsi_cmd, tvc_se_cmd);
@@ -1332,11 +1333,9 @@ static void vhost_scsi_ctl_handle_kick(struct vhost_work 
*work)
 }
 
 static void
-vhost_scsi_send_evt(struct vhost_scsi *vs,
-  struct vhost_scsi_tpg *tpg,
-  struct se_lun *lun,
-  u32 event,
-  u32 reason)
+vhost_scsi_send_evt(struct vhost_scsi *vs, struct vhost_virtqueue *vq,
+   struct vhost_scsi_tpg *tpg, struct se_lun *lun,
+   u32 event, u32 reason)
 {
struct vhost_scsi_evt *evt;
 
@@ -1358,7 +1357,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
}
 
llist_add(&evt->list, &vs->vs_event_list);
-   vhost_work_queue(&vs->dev, &vs->vs_event_work);
+   vhost_vq_work_queue(vq, &vs->vs_event_work);
 }
 
 static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
@@ -1372,7 +1371,8 @@ static void vhost_scsi_evt_handle_kick(struct vhost_work 
*work)
goto out;
 
if (vs->vs_events_missed)
-   vhost_scsi_send_evt(vs, NULL, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
+   vhost_scsi_send_evt(vs, vq, NULL, NULL, VIRTIO_SCSI_T_NO_EVENT,
+   0);
 out:
mutex_unlock(&vq->mutex);
 }
@@ -1991,7 +1991,7 @@ vhost_scsi_do_plug(struct vhost_scsi_tpg *tpg,
goto unlock;
 
if (vhost_has_feature(vq, VIRTIO_SCSI_F_HOTPLUG))
-   vhost_scsi_send_evt(vs, tpg, lun,
+   vhost_scsi_send_evt(vs, vq, tpg, lun,
   VIRTIO_SCSI_T_TRANSPORT_RESET, reason);
 unlock:
mutex_unlock(&vq->mutex);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 12/17] vhost: remove vhost_work_queue

2023-06-26 Thread Mike Christie

vhost_work_queue is no longer used. Each driver is using the poll or vq
based queueing, so remove vhost_work_queue.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 6 --
 drivers/vhost/vhost.h | 5 ++---
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index b6149ed8acb4..6cadbc6e6d11 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -255,12 +255,6 @@ static bool vhost_worker_queue(struct vhost_worker *worker,
return true;
 }
 
-bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
-{
-   return vhost_worker_queue(dev->worker, work);
-}
-EXPORT_SYMBOL_GPL(vhost_work_queue);
-
 bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
 {
return vhost_worker_queue(vq->worker, work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3882266fbbf3..83f78b6f563b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -44,15 +44,14 @@ struct vhost_poll {
struct vhost_virtqueue  *vq;
 };
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
-bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
-
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 __poll_t mask, struct vhost_dev *dev,
 struct vhost_virtqueue *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_queue(struct vhost_poll *poll);
+
+void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_dev_flush(struct vhost_dev *dev);
 
 struct vhost_log {
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 11/17] vhost_scsi: flush IO vqs then send TMF rsp

2023-06-26 Thread Mike Christie

With one worker we will always send the scsi cmd responses then send the
TMF rsp, because LIO will always complete the scsi cmds first then call
into us to send the TMF response.

With multiple workers, the IO vq workers could be running while the
TMF/ctl vq worker is running so this has us do a flush before completing
the TMF to make sure cmds are completed when it's work is later queued
and run.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 1668009bd489..2c3cf487cc71 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1133,12 +1133,27 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work 
*work)
 {
struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf,
  vwork);
-   int resp_code;
+   struct vhost_virtqueue *ctl_vq, *vq;
+   int resp_code, i;
+
+   if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) {
+   /*
+* Flush IO vqs that don't share a worker with the ctl to make
+* sure they have sent their responses before us.
+*/
+   ctl_vq = &tmf->vhost->vqs[VHOST_SCSI_VQ_CTL].vq;
+   for (i = VHOST_SCSI_VQ_IO; i < tmf->vhost->dev.nvqs; i++) {
+   vq = &tmf->vhost->vqs[i].vq;
+
+   if (vhost_vq_is_setup(vq) &&
+   vq->worker != ctl_vq->worker)
+   vhost_vq_flush(vq);
+   }
 
-   if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE)
resp_code = VIRTIO_SCSI_S_FUNCTION_SUCCEEDED;
-   else
+   } else {
resp_code = VIRTIO_SCSI_S_FUNCTION_REJECTED;
+   }
 
vhost_scsi_send_tmf_resp(tmf->vhost, &tmf->svq->vq, tmf->in_iovs,
 tmf->vq_desc, &tmf->resp_iov, resp_code);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 08/17] vhost_sock: convert to vhost_vq_work_queue

2023-06-26 Thread Mike Christie

Convert from vhost_work_queue to vhost_vq_work_queue, so we can drop
vhost_work_queue.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vsock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 6578db78f0ae..817d377a3f36 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -285,7 +285,7 @@ vhost_transport_send_pkt(struct sk_buff *skb)
atomic_inc(&vsock->queued_replies);
 
virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
-   vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
+   vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
 
rcu_read_unlock();
return len;
@@ -583,7 +583,7 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
/* Some packets may have been queued before the device was started,
 * let's kick the send worker to send them.
 */
-   vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
+   vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
 
mutex_unlock(&vsock->dev.mutex);
return 0;
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 17/17] vhost: Allow worker switching while work is queueing

2023-06-26 Thread Mike Christie

This patch drops the requirement that we can only switch workers if work
has not been queued by using RCU for the vq based queueing paths and a
mutex for the device wide flush.

We can also use this to support SIGKILL properly in the future where we
should exit almost immediately after getting that signal. With this
patch, when get_signal returns true, we can set the vq->worker to NULL
and do a synchronize_rcu to prevent new work from being queued to the
vhost_task that has been killed.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c  | 153 +++--
 drivers/vhost/vhost.h  |   4 +-
 include/uapi/linux/vhost.h |   4 +-
 3 files changed, 115 insertions(+), 46 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index bfba91ecbd0a..c71d573f1c94 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -233,16 +233,9 @@ void vhost_poll_stop(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
-static bool vhost_worker_queue(struct vhost_worker *worker,
+static void vhost_worker_queue(struct vhost_worker *worker,
   struct vhost_work *work)
 {
-   if (!worker)
-   return false;
-   /*
-* vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb
-* when setting up the worker. We don't have a smp_rmb here because
-* test_and_set_bit gives us a mb already.
-*/
if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
/* We can only add the work to the list after we're
 * sure it was not in the list.
@@ -251,47 +244,85 @@ static bool vhost_worker_queue(struct vhost_worker 
*worker,
llist_add(&work->node, &worker->work_list);
vhost_task_wake(worker->vtsk);
}
-
-   return true;
 }
 
 bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
 {
-   return vhost_worker_queue(vq->worker, work);
+   struct vhost_worker *worker;
+   bool queued = false;
+
+   rcu_read_lock();
+   worker = rcu_dereference(vq->worker);
+   if (worker) {
+   queued = true;
+   vhost_worker_queue(worker, work);
+   }
+   rcu_read_unlock();
+
+   return queued;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
 
-static void vhost_worker_flush(struct vhost_worker *worker)
+void vhost_vq_flush(struct vhost_virtqueue *vq)
 {
struct vhost_flush_struct flush;
 
init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);
 
-   if (vhost_worker_queue(worker, &flush.work))
+   if (vhost_vq_work_queue(vq, &flush.work))
wait_for_completion(&flush.wait_event);
 }
+EXPORT_SYMBOL_GPL(vhost_vq_flush);
 
-void vhost_vq_flush(struct vhost_virtqueue *vq)
+/**
+ * vhost_worker_flush - flush a worker
+ * @worker: worker to flush
+ *
+ * This does not use RCU to protect the worker, so the device or worker
+ * mutex must be held.
+ */
+static void vhost_worker_flush(struct vhost_worker *worker)
 {
-   vhost_worker_flush(vq->worker);
+   struct vhost_flush_struct flush;
+
+   init_completion(&flush.wait_event);
+   vhost_work_init(&flush.work, vhost_flush_work);
+
+   vhost_worker_queue(worker, &flush.work);
+   wait_for_completion(&flush.wait_event);
 }
-EXPORT_SYMBOL_GPL(vhost_vq_flush);
 
 void vhost_dev_flush(struct vhost_dev *dev)
 {
struct vhost_worker *worker;
unsigned long i;
 
-   xa_for_each(&dev->worker_xa, i, worker)
+   xa_for_each(&dev->worker_xa, i, worker) {
+   mutex_lock(&worker->mutex);
+   if (!worker->attachment_cnt) {
+   mutex_unlock(&worker->mutex);
+   continue;
+   }
vhost_worker_flush(worker);
+   mutex_unlock(&worker->mutex);
+   }
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_vq_has_work(struct vhost_virtqueue *vq)
 {
-   return !llist_empty(&vq->worker->work_list);
+   struct vhost_worker *worker;
+   bool has_work = false;
+
+   rcu_read_lock();
+   worker = rcu_dereference(vq->worker);
+   if (worker && !llist_empty(&worker->work_list))
+   has_work = true;
+   rcu_read_unlock();
+
+   return has_work;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_has_work);
 
@@ -356,7 +387,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->busyloop_timeout = 0;
vq->umem = NULL;
vq->iotlb = NULL;
-   vq->worker = NULL;
+   rcu_assign_pointer(vq->worker, NULL);
vhost_vring_call_reset(&vq->call_ctx);
__vhost_vq_meta_reset(vq);
 }
@@ -578,7 +6

[PATCH v9 14/17] vhost: replace single worker pointer with xarray

2023-06-26 Thread Mike Christie

The next patch allows userspace to create multiple workers per device,
so this patch replaces the vhost_worker pointer with an xarray so we
can store mupltiple workers and look them up.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 64 ---
 drivers/vhost/vhost.h |  3 +-
 2 files changed, 50 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 12203d3893c5..ffbaf7d32e2c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -280,7 +280,11 @@ EXPORT_SYMBOL_GPL(vhost_vq_flush);
 
 void vhost_dev_flush(struct vhost_dev *dev)
 {
-   vhost_worker_flush(dev->worker);
+   struct vhost_worker *worker;
+   unsigned long i;
+
+   xa_for_each(&dev->worker_xa, i, worker)
+   vhost_worker_flush(worker);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
@@ -482,7 +486,6 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   dev->worker = NULL;
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -492,7 +495,7 @@ void vhost_dev_init(struct vhost_dev *dev,
INIT_LIST_HEAD(&dev->read_list);
INIT_LIST_HEAD(&dev->pending_list);
spin_lock_init(&dev->iotlb_lock);
-
+   xa_init_flags(&dev->worker_xa, XA_FLAGS_ALLOC);
 
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
@@ -554,15 +557,35 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
 }
 
-static void vhost_worker_free(struct vhost_dev *dev)
+static void vhost_worker_destroy(struct vhost_dev *dev,
+struct vhost_worker *worker)
+{
+   if (!worker)
+   return;
+
+   WARN_ON(!llist_empty(&worker->work_list));
+   xa_erase(&dev->worker_xa, worker->id);
+   vhost_task_stop(worker->vtsk);
+   kfree(worker);
+}
+
+static void vhost_workers_free(struct vhost_dev *dev)
 {
-   if (!dev->worker)
+   struct vhost_worker *worker;
+   unsigned long i;
+
+   if (!dev->use_worker)
return;
 
-   WARN_ON(!llist_empty(&dev->worker->work_list));
-   vhost_task_stop(dev->worker->vtsk);
-   kfree(dev->worker);
-   dev->worker = NULL;
+   for (i = 0; i < dev->nvqs; i++)
+   dev->vqs[i]->worker = NULL;
+   /*
+* Free the default worker we created and cleanup workers userspace
+* created but couldn't clean up (it forgot or crashed).
+*/
+   xa_for_each(&dev->worker_xa, i, worker)
+   vhost_worker_destroy(dev, worker);
+   xa_destroy(&dev->worker_xa);
 }
 
 static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
@@ -570,6 +593,8 @@ static struct vhost_worker *vhost_worker_create(struct 
vhost_dev *dev)
struct vhost_worker *worker;
struct vhost_task *vtsk;
char name[TASK_COMM_LEN];
+   int ret;
+   u32 id;
 
worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
if (!worker)
@@ -584,16 +609,18 @@ static struct vhost_worker *vhost_worker_create(struct 
vhost_dev *dev)
init_llist_head(&worker->work_list);
worker->kcov_handle = kcov_common_handle();
worker->vtsk = vtsk;
-   /*
-* vsock can already try to queue so make sure llist and vtsk are both
-* set before vhost_work_queue sees dev->worker is set.
-*/
-   smp_wmb();
-   dev->worker = worker;
 
vhost_task_start(vtsk);
+
+   ret = xa_alloc(&dev->worker_xa, &id, worker, xa_limit_32b, GFP_KERNEL);
+   if (ret < 0)
+   goto stop_worker;
+   worker->id = id;
+
return worker;
 
+stop_worker:
+   vhost_task_stop(vtsk);
 free_worker:
kfree(worker);
return NULL;
@@ -650,6 +677,11 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
err = -ENOMEM;
goto err_worker;
}
+   /*
+* vsock can already try to queue so make sure the worker
+* is setup before vhost_vq_work_queue sees vq->worker is set.
+*/
+   smp_wmb();
 
for (i = 0; i < dev->nvqs; i++)
dev->vqs[i]->worker = worker;
@@ -751,7 +783,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
dev->iotlb = NULL;
vhost_clear_msg(dev);
wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-   vhost_worker_free(dev);
+   vhost_workers_free(dev);
vhost_detach_mm(dev);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 83f78b6f563b

[PATCH v9 15/17] vhost: allow userspace to create workers

2023-06-26 Thread Mike Christie

For vhost-scsi with 3 vqs or more and a workload that tries to use
them in parallel like:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128  --numjobs=3

the single vhost worker thread will become a bottlneck and we are stuck
at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are
used.

To better utilize virtqueues and available CPUs, this patch allows
userspace to create workers and bind them to vqs. You can have N workers
per dev and also share N workers with M vqs on that dev.

This patch adds the interface related code and the next patch will hook
vhost-scsi into it. The patches do not try to hook net and vsock into
the interface because:

1. multiple workers don't seem to help vsock. The problem is that with
only 2 virtqueues we never fully use the existing worker when doing
bidirectional tests. This seems to match vhost-scsi where we don't see
the worker as a bottleneck until 3 virtqueues are used.

2. net already has a way to use multiple workers.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c| 142 ++-
 drivers/vhost/vhost.h|   3 +
 include/uapi/linux/vhost.h   |  33 +++
 include/uapi/linux/vhost_types.h |  16 
 4 files changed, 193 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index ffbaf7d32e2c..bfba91ecbd0a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -626,6 +626,76 @@ static struct vhost_worker *vhost_worker_create(struct 
vhost_dev *dev)
return NULL;
 }
 
+/* Caller must have device mutex */
+static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
+struct vhost_worker *worker)
+{
+   if (vq->worker)
+   vq->worker->attachment_cnt--;
+   worker->attachment_cnt++;
+   vq->worker = worker;
+}
+
+ /* Caller must have device and virtqueue mutex */
+static int vhost_vq_attach_worker(struct vhost_virtqueue *vq,
+ struct vhost_vring_worker *info)
+{
+   unsigned long index = info->worker_id;
+   struct vhost_dev *dev = vq->dev;
+   struct vhost_worker *worker;
+
+   if (!dev->use_worker)
+   return -EINVAL;
+
+   /*
+* We only allow userspace to set a virtqueue's worker if it's not
+* active and polling is not enabled. We also assume drivers
+* supporting this will not be internally queueing works directly or
+* via calls like vhost_dev_flush at this time.
+*/
+   if (vhost_vq_get_backend(vq) || vq->kick)
+   return -EBUSY;
+
+   worker = xa_find(&dev->worker_xa, &index, UINT_MAX, XA_PRESENT);
+   if (!worker || worker->id != info->worker_id)
+   return -ENODEV;
+
+   __vhost_vq_attach_worker(vq, worker);
+   return 0;
+}
+
+/* Caller must have device mutex */
+static int vhost_new_worker(struct vhost_dev *dev,
+   struct vhost_worker_state *info)
+{
+   struct vhost_worker *worker;
+
+   worker = vhost_worker_create(dev);
+   if (!worker)
+   return -ENOMEM;
+
+   info->worker_id = worker->id;
+   return 0;
+}
+
+/* Caller must have device mutex */
+static int vhost_free_worker(struct vhost_dev *dev,
+struct vhost_worker_state *info)
+{
+   unsigned long index = info->worker_id;
+   struct vhost_worker *worker;
+
+   worker = xa_find(&dev->worker_xa, &index, UINT_MAX, XA_PRESENT);
+   if (!worker || worker->id != info->worker_id)
+   return -ENODEV;
+
+   if (worker->attachment_cnt)
+   return -EBUSY;
+
+   vhost_worker_destroy(dev, worker);
+   return 0;
+}
+
 static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp,
  struct vhost_virtqueue **vq, u32 *id)
 {
@@ -647,6 +717,76 @@ static int vhost_get_vq_from_user(struct vhost_dev *dev, 
void __user *argp,
return 0;
 }
 
+/* Caller must have device mutex */
+long vhost_worker_ioctl(struct vhost_dev *dev, unsigned int ioctl,
+   void __user *argp)
+{
+   struct vhost_vring_worker ring_worker;
+   struct vhost_worker_state state;
+   struct vhost_virtqueue *vq;
+   long ret;
+   u32 idx;
+
+   if (!dev->use_worker)
+   return -EINVAL;
+
+   if (!vhost_dev_has_owner(dev))
+   return -EINVAL;
+
+   ret = vhost_dev_check_owner(dev);
+   if (ret)
+   return ret;
+
+   switch (ioctl) {
+   /* dev worker ioctls */
+   case VHOST_NEW_WORKER:
+   ret = vhost_new_worker(dev, &state);
+   if (!ret && copy_to_user(argp, &state, sizeof(state)))
+   ret = -EFAULT;
+   return ret;
+   case VH

[PATCH v9 03/17] vhost: add vhost_worker pointer to vhost_virtqueue

2023-06-26 Thread Mike Christie

This patchset allows userspace to map vqs to different workers. This
patch adds a worker pointer to the vq so in later patches in this set
we can queue/flush specific vqs and their workers.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 21 ++---
 drivers/vhost/vhost.h |  1 +
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index dfd96cf6a152..db88464c1691 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -333,6 +333,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
vq->busyloop_timeout = 0;
vq->umem = NULL;
vq->iotlb = NULL;
+   vq->worker = NULL;
vhost_vring_call_reset(&vq->call_ctx);
__vhost_vq_meta_reset(vq);
 }
@@ -545,7 +546,7 @@ static void vhost_worker_free(struct vhost_dev *dev)
dev->worker = NULL;
 }
 
-static int vhost_worker_create(struct vhost_dev *dev)
+static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 {
struct vhost_worker *worker;
struct vhost_task *vtsk;
@@ -553,7 +554,7 @@ static int vhost_worker_create(struct vhost_dev *dev)
 
worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
if (!worker)
-   return -ENOMEM;
+   return NULL;
 
snprintf(name, sizeof(name), "vhost-%d", current->pid);
 
@@ -572,17 +573,18 @@ static int vhost_worker_create(struct vhost_dev *dev)
dev->worker = worker;
 
vhost_task_start(vtsk);
-   return 0;
+   return worker;
 
 free_worker:
kfree(worker);
-   return -ENOMEM;
+   return NULL;
 }
 
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-   int err;
+   struct vhost_worker *worker;
+   int err, i;
 
/* Is there an owner already? */
if (vhost_dev_has_owner(dev)) {
@@ -603,9 +605,14 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 * below since we don't have to worry about vsock queueing
 * while we free the worker.
 */
-   err = vhost_worker_create(dev);
-   if (err)
+   worker = vhost_worker_create(dev);
+   if (!worker) {
+   err = -ENOMEM;
goto err_worker;
+   }
+
+   for (i = 0; i < dev->nvqs; i++)
+   dev->vqs[i]->worker = worker;
}
 
return 0;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6cd72d0b5245..0baacc245934 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -74,6 +74,7 @@ struct vhost_vring_call {
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
struct vhost_dev *dev;
+   struct vhost_worker *worker;
 
/* The actual ring of buffers. */
struct mutex mutex;
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 02/17] vhost: dynamically allocate vhost_worker

2023-06-26 Thread Mike Christie

This patchset allows us to allocate multiple workers, so this has us
move from the vhost_worker that's embedded in the vhost_dev to
dynamically allocating it.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 66 ---
 drivers/vhost/vhost.h |  4 +--
 2 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 82966ffb4a5c..dfd96cf6a152 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -235,36 +235,40 @@ void vhost_dev_flush(struct vhost_dev *dev)
 {
struct vhost_flush_struct flush;
 
-   if (dev->worker.vtsk) {
-   init_completion(&flush.wait_event);
-   vhost_work_init(&flush.work, vhost_flush_work);
+   init_completion(&flush.wait_event);
+   vhost_work_init(&flush.work, vhost_flush_work);
 
-   vhost_work_queue(dev, &flush.work);
+   if (vhost_work_queue(dev, &flush.work))
wait_for_completion(&flush.wait_event);
-   }
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
-void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 {
-   if (!dev->worker.vtsk)
-   return;
-
+   if (!dev->worker)
+   return false;
+   /*
+* vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb
+* when setting up the worker. We don't have a smp_rmb here because
+* test_and_set_bit gives us a mb already.
+*/
if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
/* We can only add the work to the list after we're
 * sure it was not in the list.
 * test_and_set_bit() implies a memory barrier.
 */
-   llist_add(&work->node, &dev->worker.work_list);
-   vhost_task_wake(dev->worker.vtsk);
+   llist_add(&work->node, &dev->worker->work_list);
+   vhost_task_wake(dev->worker->vtsk);
}
+
+   return true;
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-   return !llist_empty(&dev->worker.work_list);
+   return !llist_empty(&dev->worker->work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -458,8 +462,7 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   memset(&dev->worker, 0, sizeof(dev->worker));
-   init_llist_head(&dev->worker.work_list);
+   dev->worker = NULL;
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -533,30 +536,47 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 
 static void vhost_worker_free(struct vhost_dev *dev)
 {
-   if (!dev->worker.vtsk)
+   if (!dev->worker)
return;
 
-   WARN_ON(!llist_empty(&dev->worker.work_list));
-   vhost_task_stop(dev->worker.vtsk);
-   dev->worker.kcov_handle = 0;
-   dev->worker.vtsk = NULL;
+   WARN_ON(!llist_empty(&dev->worker->work_list));
+   vhost_task_stop(dev->worker->vtsk);
+   kfree(dev->worker);
+   dev->worker = NULL;
 }
 
 static int vhost_worker_create(struct vhost_dev *dev)
 {
+   struct vhost_worker *worker;
struct vhost_task *vtsk;
char name[TASK_COMM_LEN];
 
+   worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+   if (!worker)
+   return -ENOMEM;
+
snprintf(name, sizeof(name), "vhost-%d", current->pid);
 
-   vtsk = vhost_task_create(vhost_worker, &dev->worker, name);
+   vtsk = vhost_task_create(vhost_worker, worker, name);
if (!vtsk)
-   return -ENOMEM;
+   goto free_worker;
+
+   init_llist_head(&worker->work_list);
+   worker->kcov_handle = kcov_common_handle();
+   worker->vtsk = vtsk;
+   /*
+* vsock can already try to queue so make sure llist and vtsk are both
+* set before vhost_work_queue sees dev->worker is set.
+*/
+   smp_wmb();
+   dev->worker = worker;
 
-   dev->worker.kcov_handle = kcov_common_handle();
-   dev->worker.vtsk = vtsk;
vhost_task_start(vtsk);
return 0;
+
+free_worker:
+   kfree(worker);
+   return -ENOMEM;
 }
 
 /* Caller should have device mutex */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 37ce869f8a5c..6cd72d0b5245 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -44,7 +44,7 @@ struct vhost_poll {
 };
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
-void vhost_w

[PATCH v9 09/17] vhost_scsi: make SCSI cmd completion per vq

2023-06-26 Thread Mike Christie

This patch separates the scsi cmd completion code paths so we can complete
cmds based on their vq instead of having all cmds complete on the same
worker/CPU. This will be useful with the next patches that allow us to
create mulitple worker threads and bind them to different vqs, and we can
have completions running on different threads/CPUs.

Signed-off-by: Mike Christie 
Reviewed-by: Stefan Hajnoczi 
---
 drivers/vhost/scsi.c | 56 
 1 file changed, 26 insertions(+), 30 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index bb10fa4bb4f6..a77c53bb035a 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -167,6 +167,7 @@ MODULE_PARM_DESC(max_io_vqs, "Set the max number of IO 
virtqueues a vhost scsi d
 
 struct vhost_scsi_virtqueue {
struct vhost_virtqueue vq;
+   struct vhost_scsi *vs;
/*
 * Reference counting for inflight reqs, used for flush operation. At
 * each time, one reference tracks new commands submitted, while we
@@ -181,6 +182,9 @@ struct vhost_scsi_virtqueue {
struct vhost_scsi_cmd *scsi_cmds;
struct sbitmap scsi_tags;
int max_cmds;
+
+   struct vhost_work completion_work;
+   struct llist_head completion_list;
 };
 
 struct vhost_scsi {
@@ -190,12 +194,8 @@ struct vhost_scsi {
 
struct vhost_dev dev;
struct vhost_scsi_virtqueue *vqs;
-   unsigned long *compl_bitmap;
struct vhost_scsi_inflight **old_inflight;
 
-   struct vhost_work vs_completion_work; /* cmd completion work item */
-   struct llist_head vs_completion_list; /* cmd completion queue */
-
struct vhost_work vs_event_work; /* evt injection work item */
struct llist_head vs_event_list; /* evt injection queue */
 
@@ -358,10 +358,11 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
} else {
struct vhost_scsi_cmd *cmd = container_of(se_cmd,
struct vhost_scsi_cmd, tvc_se_cmd);
-   struct vhost_scsi *vs = cmd->tvc_vhost;
+   struct vhost_scsi_virtqueue *svq =  container_of(cmd->tvc_vq,
+   struct vhost_scsi_virtqueue, vq);
 
-   llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list);
-   vhost_work_queue(&vs->dev, &vs->vs_completion_work);
+   llist_add(&cmd->tvc_completion_list, &svq->completion_list);
+   vhost_vq_work_queue(&svq->vq, &svq->completion_work);
}
 }
 
@@ -509,17 +510,17 @@ static void vhost_scsi_evt_work(struct vhost_work *work)
  */
 static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 {
-   struct vhost_scsi *vs = container_of(work, struct vhost_scsi,
-   vs_completion_work);
+   struct vhost_scsi_virtqueue *svq = container_of(work,
+   struct vhost_scsi_virtqueue, completion_work);
struct virtio_scsi_cmd_resp v_rsp;
struct vhost_scsi_cmd *cmd, *t;
struct llist_node *llnode;
struct se_cmd *se_cmd;
struct iov_iter iov_iter;
-   int ret, vq;
+   bool signal = false;
+   int ret;
 
-   bitmap_zero(vs->compl_bitmap, vs->dev.nvqs);
-   llnode = llist_del_all(&vs->vs_completion_list);
+   llnode = llist_del_all(&svq->completion_list);
llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) {
se_cmd = &cmd->tvc_se_cmd;
 
@@ -539,21 +540,17 @@ static void vhost_scsi_complete_cmd_work(struct 
vhost_work *work)
  cmd->tvc_in_iovs, sizeof(v_rsp));
ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter);
if (likely(ret == sizeof(v_rsp))) {
-   struct vhost_scsi_virtqueue *q;
+   signal = true;
+
vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
-   q = container_of(cmd->tvc_vq, struct 
vhost_scsi_virtqueue, vq);
-   vq = q - vs->vqs;
-   __set_bit(vq, vs->compl_bitmap);
} else
pr_err("Faulted on virtio_scsi_cmd_resp\n");
 
vhost_scsi_release_cmd_res(se_cmd);
}
 
-   vq = -1;
-   while ((vq = find_next_bit(vs->compl_bitmap, vs->dev.nvqs, vq + 1))
-   < vs->dev.nvqs)
-   vhost_signal(&vs->dev, &vs->vqs[vq].vq);
+   if (signal)
+   vhost_signal(&svq->vs->dev, &svq->vq);
 }
 
 static struct vhost_scsi_cmd *
@@ -1770,6 +1767,7 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, 
u64 features)
 
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
+   struct vhost_scsi_virtqueue *svq;

[PATCH v9 04/17] vhost, vhost_net: add helper to check if vq has work

2023-06-26 Thread Mike Christie

In the next patches each vq might have different workers so one could
have work but others do not. For net, we only want to check specific vqs,
so this adds a helper to check if a vq has work pending and converts
vhost-net to use it.

Signed-off-by: Mike Christie 
Acked-by: Jason Wang 
---
 drivers/vhost/net.c   | 2 +-
 drivers/vhost/vhost.c | 6 +++---
 drivers/vhost/vhost.h | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ae2273196b0c..98bb957eb3b9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -546,7 +546,7 @@ static void vhost_net_busy_poll(struct vhost_net *net,
endtime = busy_clock() + busyloop_timeout;
 
while (vhost_can_busy_poll(endtime)) {
-   if (vhost_has_work(&net->dev)) {
+   if (vhost_vq_has_work(vq)) {
*busyloop_intr = true;
break;
}
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index db88464c1691..aafb23e12477 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -266,11 +266,11 @@ bool vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
 /* A lockless hint for busy polling code to exit the loop */
-bool vhost_has_work(struct vhost_dev *dev)
+bool vhost_vq_has_work(struct vhost_virtqueue *vq)
 {
-   return !llist_empty(&dev->worker->work_list);
+   return !llist_empty(&vq->worker->work_list);
 }
-EXPORT_SYMBOL_GPL(vhost_has_work);
+EXPORT_SYMBOL_GPL(vhost_vq_has_work);
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0baacc245934..d5728f986744 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -45,7 +45,6 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
-bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 __poll_t mask, struct vhost_dev *dev);
@@ -199,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
+bool vhost_vq_has_work(struct vhost_virtqueue *vq);
 bool vhost_vq_is_setup(struct vhost_virtqueue *vq);
 int vhost_vq_init_access(struct vhost_virtqueue *);
 int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 06/17] vhost: take worker or vq for flushing

2023-06-26 Thread Mike Christie

This patch has the core work flush function take a worker. When we
support multiple workers we can then flush each worker during device
removal, stoppage, etc. It also adds a helper to flush specific
virtqueues, so vhost-scsi can flush IO vqs from it's ctl vq.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 15 +--
 drivers/vhost/vhost.h |  1 +
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 611e495eeb3c..2ea107e867a1 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -265,16 +265,27 @@ bool vhost_vq_work_queue(struct vhost_virtqueue *vq, 
struct vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
 
-void vhost_dev_flush(struct vhost_dev *dev)
+static void vhost_worker_flush(struct vhost_worker *worker)
 {
struct vhost_flush_struct flush;
 
init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);
 
-   if (vhost_work_queue(dev, &flush.work))
+   if (vhost_worker_queue(worker, &flush.work))
wait_for_completion(&flush.wait_event);
 }
+
+void vhost_vq_flush(struct vhost_virtqueue *vq)
+{
+   vhost_worker_flush(vq->worker);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_flush);
+
+void vhost_dev_flush(struct vhost_dev *dev)
+{
+   vhost_worker_flush(dev->worker);
+}
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
 /* A lockless hint for busy polling code to exit the loop */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index e2dd4558a1f9..f208f9923c88 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -198,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
+void vhost_vq_flush(struct vhost_virtqueue *vq);
 bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
 bool vhost_vq_has_work(struct vhost_virtqueue *vq);
 bool vhost_vq_is_setup(struct vhost_virtqueue *vq);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 01/17] vhost: create worker at end of vhost_dev_set_owner

2023-06-26 Thread Mike Christie

vsock can start queueing work after VHOST_VSOCK_SET_GUEST_CID, so
after we have called vhost_worker_create it can be calling
vhost_work_queue and trying to access the vhost worker/task. If
vhost_dev_alloc_iovecs fails, then vhost_worker_free could free
the worker/task from under vsock.

This moves vhost_worker_create to the end of vhost_dev_set_owner
where we know we can no longer fail in that path. If it fails
after the VHOST_SET_OWNER and userspace closes the device, then
the normal vsock release handling will do the right thing.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 60c9ebd629dd..82966ffb4a5c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -572,20 +572,27 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
vhost_attach_mm(dev);
 
+   err = vhost_dev_alloc_iovecs(dev);
+   if (err)
+   goto err_iovecs;
+
if (dev->use_worker) {
+   /*
+* This should be done last, because vsock can queue work
+* before VHOST_SET_OWNER so it simplifies the failure path
+* below since we don't have to worry about vsock queueing
+* while we free the worker.
+*/
err = vhost_worker_create(dev);
if (err)
goto err_worker;
}
 
-   err = vhost_dev_alloc_iovecs(dev);
-   if (err)
-   goto err_iovecs;
-
return 0;
-err_iovecs:
-   vhost_worker_free(dev);
+
 err_worker:
+   vhost_dev_free_iovecs(dev);
+err_iovecs:
vhost_detach_mm(dev);
 err_mm:
return err;
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 00/17] vhost: multiple worker support

2023-06-26 Thread Mike Christie

The following patches were built over Linux's tree. The also apply over
the mst vhost branch if you revert the previous vhost worker patchset.

The patches allow us to support multiple vhost workers tasks per device.
The design is a modified version of Stefan's original idea where userspace
has the kernel create a worker and we pass back the pid. In this version
instead of passing the pid between user/kernel space we use a worker_id
which is just an integer managed by the vhost driver and we allow
userspace to create and free workers and then attach them to virtqueues
at setup time.

All review comments from the past reviews should be handled. If I didn't
reply to a review comment, I agreed with the comment and should have
handled it in this posting. Let me know if I missed one.

Results:


fio jobs1   2   4   8   12  16
--
1 worker160k   488k -   -   -   -
worker per vq   160k   310k620k1182K1564K   2002k

Notes:
0. This used a simple fio command:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128  --numjobs=$JOBS_ABOVE

and I used a VM with 16 vCPUs and 16 virtqueues.

1. The patches were tested with LIO's emulate_pr=0 which drops the
LIO PR lock use. This was a bottleneck at around 12 vqs/jobs.

2. Because we have a hard limit of 1024 cmds, if the num jobs * iodepth
was greater than 1024, I would decrease iodepth. So 12 jobs used 85 cmds,
and 16 used 64.

3. The perf issue above at 2 jobs is because when we only have 1 worker
we execute more cmds per vhost_work due to all vqs funneling to one worker.
I have 2 patches that fix this. One is just submit from the worker thread
instead of kikcing off to a workqueue like how the vhost block patches do.
I'll post this after the worker support is merged. I'm also working on
patches to add back batching during lio completion and do polling on the
submission side.

We will still want the threading patches, because if we batch at the fio
level plus use the vhost theading patches, we can see a big boost like
below. So hopefully doing it at the kernel will allow apps to just work
without having to be smart like fio.

fio using io_uring and batching with the iodepth_batch* settings:

fio jobs1   2   4   8   12  16
-
1 worker494k520k-   -   -   -
worker per vq   496k878k1542k   2436k   2304k   2590k

V9:
- Fix vhost_dev_reset_owner handling. Make sure a virtqueue's
worker is cleared so when the vhost_dev is reused the old worker is
not used.
- Remove old/unused attach ioctl copy to user code.

V8:
- Rebase.
- Move comments about assumptions so it's in the body of the code
instead of top of function so it's more clear.
- Added patch for RCU support and swapping workers while works are
running.

V7:
- Add more comments about assumptions.
- Drop refcounting and just use an attachment_cnt variable, so there
is no confusion about when a worker is freed.
- Do a opt-in model, because vsiock has an issue where it can queue works
before it's running and it doesn't even need multiple workers, so there
is no point in chaning the driver or core code.
- Add back get worker ioctl.
- Broke up last patches to make it easier to read.

V6:
- Rebase against vhost_task patchset.
- Used xa instead of idr.

V5:
- Rebase against user_worker patchset.
- Rebase against flush patchset.
- Redo vhost-scsi tmf flush handling so it doesn't access vq->worker.

V4:
- fix vhost-sock VSOCK_VQ_RX use.
- name functions called directly by ioctl cmd's to match the ioctl cmd.
- break up VHOST_SET_VRING_WORKER into a new, free and attach cmd.
- document worker lifetime, and cgroup, namespace, mm, rlimit
inheritance, make it clear we currently only support sharing within the
device.
- add support to attach workers while IO is running.
- instead of passing a pid_t of the kernel thread, pass a int allocated
by the vhost layer with an idr.

V3:
- fully convert vhost code to use vq based APIs instead of leaving it
half per dev and half per vq.
- rebase against kernel worker API.
- Drop delayed worker creation. We always create the default worker at
VHOST_SET_OWNER time. Userspace can create and bind workers after that.

V2:
- change loop that we take a refcount to the worker in
- replaced pid == -1 with define.
- fixed tabbing/spacing coding style issue
- use hash instead of list to lookup workers.
- I dropped the patch that added an ioctl cmd to get a vq's worker's
pid. I saw we might do a generic netlink interface instead.




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v9 07/17] vhost: convert poll work to be vq based

2023-06-26 Thread Mike Christie

This has the drivers pass in their poll to vq mapping and then converts
the core poll code to use the vq based helpers. In the next patches we
will allow vqs to be handled by different workers, so to allow drivers
to execute operations like queue, stop, flush, etc on specific polls/vqs
we need to know the mappings.

Signed-off-by: Mike Christie 
---
 drivers/vhost/net.c   | 6 --
 drivers/vhost/vhost.c | 8 +---
 drivers/vhost/vhost.h | 4 +++-
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 98bb957eb3b9..f2ed7167c848 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1347,8 +1347,10 @@ static int vhost_net_open(struct inode *inode, struct 
file *f)
   VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
   NULL);
 
-   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, 
dev);
-   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
+   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev,
+   vqs[VHOST_NET_VQ_TX]);
+   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev,
+   vqs[VHOST_NET_VQ_RX]);
 
f->private_data = n;
n->page_frag.page = NULL;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ea107e867a1..b6149ed8acb4 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -187,13 +187,15 @@ EXPORT_SYMBOL_GPL(vhost_work_init);
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-__poll_t mask, struct vhost_dev *dev)
+__poll_t mask, struct vhost_dev *dev,
+struct vhost_virtqueue *vq)
 {
init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
init_poll_funcptr(&poll->table, vhost_poll_func);
poll->mask = mask;
poll->dev = dev;
poll->wqh = NULL;
+   poll->vq = vq;
 
vhost_work_init(&poll->work, fn);
 }
@@ -297,7 +299,7 @@ EXPORT_SYMBOL_GPL(vhost_vq_has_work);
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-   vhost_work_queue(poll->dev, &poll->work);
+   vhost_vq_work_queue(poll->vq, &poll->work);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
@@ -508,7 +510,7 @@ void vhost_dev_init(struct vhost_dev *dev,
vhost_vq_reset(dev, vq);
if (vq->handle_kick)
vhost_poll_init(&vq->poll, vq->handle_kick,
-   EPOLLIN, dev);
+   EPOLLIN, dev, vq);
}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index f208f9923c88..3882266fbbf3 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -41,13 +41,15 @@ struct vhost_poll {
struct vhost_work   work;
__poll_tmask;
struct vhost_dev*dev;
+   struct vhost_virtqueue  *vq;
 };
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-__poll_t mask, struct vhost_dev *dev);
+__poll_t mask, struct vhost_dev *dev,
+struct vhost_virtqueue *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_queue(struct vhost_poll *poll);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [syzbot] [net?] [virt?] [kvm?] KASAN: slab-use-after-free Read in __vhost_vq_attach_worker

2023-06-26 Thread Mike Christie

On 6/26/23 2:15 AM, Michael S. Tsirkin wrote:
> On Mon, Jun 26, 2023 at 12:06:54AM -0700, syzbot wrote:
>> Hello,
>>
>> syzbot found the following issue on:
>>
>> HEAD commit:8d2be868b42c Add linux-next specific files for 20230623
>> git tree:   linux-next
>> console+strace: https://syzkaller.appspot.com/x/log.txt?x=12872950a8
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=d8ac8dd33677e8e0
>> dashboard link: https://syzkaller.appspot.com/bug?extid=8540db210d403f1aa214
>> compiler:   gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils 
>> for Debian) 2.35.2
>> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=15c1b70f28
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=122ee4cb28
>>
>> Downloadable assets:
>> disk image: 
>> https://storage.googleapis.com/syzbot-assets/2a004483aca3/disk-8d2be868.raw.xz
>> vmlinux: 
>> https://storage.googleapis.com/syzbot-assets/5688cb13b277/vmlinux-8d2be868.xz
>> kernel image: 
>> https://storage.googleapis.com/syzbot-assets/76de0b63bc53/bzImage-8d2be868.xz
>>
>> The issue was bisected to:
>>
>> commit 21a18f4a51896fde11002165f0e7340f4131d6a0
>> Author: Mike Christie 
>> Date:   Tue Jun 13 01:32:46 2023 +
>>
>> vhost: allow userspace to create workers
>>
>> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=130850bf28
>> final oops: https://syzkaller.appspot.com/x/report.txt?x=108850bf28
>> console output: https://syzkaller.appspot.com/x/log.txt?x=170850bf28
>>
>> IMPORTANT: if you fix the issue, please add the following tag to the commit:
>> Reported-by: syzbot+8540db210d403f1aa...@syzkaller.appspotmail.com
>> Fixes: 21a18f4a5189 ("vhost: allow userspace to create workers")
> 
> Mike, would appreciate prompt attention to this as I am preparing
> a pull request for the merge window and need to make a
> decision on whether to include your userspace-controlled
> threading patchset.
> 

Do you want me to resubmit the patchset or submit a patch against your vhost
branch?

The bug is that vhost-net can call vhost_dev_reset_owner and that will
free the workers. However, I leave the virtqueue->worker pointer set so
we end up referencing the freed workers later on. When I handled a
review comment between v5 and v6, I deleted that code thinking it was
also not needed.

So the fix is:

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index ab79b064aade..5a07e220e46d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -607,6 +607,10 @@ static void vhost_workers_free(struct vhost_dev *dev)
 
if (!dev->use_worker)
return;
+
+   for (i = 0; i < dev->nvqs; i++)
+   rcu_assign_pointer(dev->vqs[i]->worker, NULL);
+
/*
 * Free the default worker we created and cleanup workers userspace
 * created but couldn't clean up (it forgot or crashed).



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Can vhost translate to io_uring?

2023-06-14 Thread Mike Christie

On 6/14/23 1:25 AM, michael.chris...@oracle.com wrote:
> It would be nice if the vhost layer could use the io-wq code as sort of
> generic worker. I can look into what that would take if Jens is ok
> with that type of thing.

We could use the io-wq code, but we hit the same problems as before:

1. We still need to modify the vhost drivers like I mentioned below so when
the task gets SIGKILL the drivers fail instead of running the work like
normal.

2. We still need some code like the patch below so when the worker task
exits and is freed the vhost drivers stop calling io_wq_enqueue and
don't access the io_wq anymore.

3. There's some other small things which seem easy to change like we need
to create the worker thread/task_struct when io_wq_create is run instead of
io_wq_enqueue. The problem is that we can queue work from threads that
have different mms than we want to use.

I've done #2 in the patch below. I'm almost done with #1. Just testing it
now. When that's done, we can remove the signal hacks and then decide if we
want to go further and switch to io-wq.

> 
> For vhost, I just submitted a patch to the vhost layer that allow us to
> switch out the vhost worker thread when IO is running:
> 
> https://lists.linuxfoundation.org/pipermail/virtualization/2023-June/067246.html
> 
> After that patch, I just need to modify vhost_worker/vhost_task_fn so
> when get_signal returns true we set the worker to NULL and do a 
> synchronize_rcu.
> Then I just need to modify vhost-scsi so it detects when the worker is NULL
> and modify the flush code so it handles work that didn't get to run.
> 
> 
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v8 17/17] vhost: Allow worker switching while work is queueing

2023-06-12 Thread Mike Christie

On 6/12/23 8:32 PM, Mike Christie wrote:
> This patch drops the requirement that we can only switch workers if work
> has not been queued by using RCU for the vq based queueing paths and a
> mutex for the device wide flush.
> 
> We can also use this to support SIGKILL properly in the future where we
> should exit almost immediately after getting that signal. With this
> patch, when get_signal returns true, we can set the vq->worker to NULL
> and do a synchronize_rcu to prevent new work from being queued to the
> vhost_task that has been killed.
> 

Hey Jason and Stefano, just an update on why we now have this extra
patch.

Jason, in one of the last reviews you were asking about supporting
switching workers when works are queueing and I had said it's probably
not worth it, because it adds come complexity that might not be used.

Stefano, it sounded like you preferred RCU to handle when we are
adding the initial worker while vsock is possibly queueing works.

It turns out the signal/fork developers added some hacks to their code
for the vhost_task patches to support SIGKILL but I think they want us to
eventually remove them. To do that, we need a way to handle where the
vhost_worker/vhost_task exits while work is being queued. To do this we
need the same as the above (instead of a new worker it would be NULL
though).

So to handle all these requests, I ended up adding this last patch.
I wasn't sure if you wanted it to go in at the same time as the
initial multiple worker patches since it changes the behavior the
interface can support or separate when I fix up the SIGKILL code.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 14/17] vhost: replace single worker pointer with xarray

2023-06-12 Thread Mike Christie

The next patch allows userspace to create multiple workers per device,
so this patch replaces the vhost_worker pointer with an xarray so we
can store mupltiple workers and look them up.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 61 +++
 drivers/vhost/vhost.h |  3 ++-
 2 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 0db093dfaa22..f07f92a67e59 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -280,7 +280,11 @@ EXPORT_SYMBOL_GPL(vhost_vq_flush);
 
 void vhost_dev_flush(struct vhost_dev *dev)
 {
-   vhost_worker_flush(dev->worker);
+   struct vhost_worker *worker;
+   unsigned long i;
+
+   xa_for_each(&dev->worker_xa, i, worker)
+   vhost_worker_flush(worker);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
@@ -481,7 +485,6 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   dev->worker = NULL;
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -491,7 +494,7 @@ void vhost_dev_init(struct vhost_dev *dev,
INIT_LIST_HEAD(&dev->read_list);
INIT_LIST_HEAD(&dev->pending_list);
spin_lock_init(&dev->iotlb_lock);
-
+   xa_init_flags(&dev->worker_xa, XA_FLAGS_ALLOC);
 
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
@@ -554,15 +557,32 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
 }
 
-static void vhost_worker_free(struct vhost_dev *dev)
+static void vhost_worker_destroy(struct vhost_dev *dev,
+struct vhost_worker *worker)
 {
-   if (!dev->worker)
+   if (!worker)
return;
 
-   WARN_ON(!llist_empty(&dev->worker->work_list));
-   vhost_task_stop(dev->worker->vtsk);
-   kfree(dev->worker);
-   dev->worker = NULL;
+   WARN_ON(!llist_empty(&worker->work_list));
+   xa_erase(&dev->worker_xa, worker->id);
+   vhost_task_stop(worker->vtsk);
+   kfree(worker);
+}
+
+static void vhost_workers_free(struct vhost_dev *dev)
+{
+   struct vhost_worker *worker;
+   unsigned long i;
+
+   if (!dev->use_worker)
+   return;
+   /*
+* Free the default worker we created and cleanup workers userspace
+* created but couldn't clean up (it forgot or crashed).
+*/
+   xa_for_each(&dev->worker_xa, i, worker)
+   vhost_worker_destroy(dev, worker);
+   xa_destroy(&dev->worker_xa);
 }
 
 static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
@@ -570,6 +590,8 @@ static struct vhost_worker *vhost_worker_create(struct 
vhost_dev *dev)
struct vhost_worker *worker;
struct vhost_task *vtsk;
char name[TASK_COMM_LEN];
+   int ret;
+   u32 id;
 
worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
if (!worker)
@@ -584,16 +606,18 @@ static struct vhost_worker *vhost_worker_create(struct 
vhost_dev *dev)
init_llist_head(&worker->work_list);
worker->kcov_handle = kcov_common_handle();
worker->vtsk = vtsk;
-   /*
-* vsock can already try to queue so make sure llist and vtsk are both
-* set before vhost_work_queue sees dev->worker is set.
-*/
-   smp_wmb();
-   dev->worker = worker;
 
vhost_task_start(vtsk);
+
+   ret = xa_alloc(&dev->worker_xa, &id, worker, xa_limit_32b, GFP_KERNEL);
+   if (ret < 0)
+   goto stop_worker;
+   worker->id = id;
+
return worker;
 
+stop_worker:
+   vhost_task_stop(vtsk);
 free_worker:
kfree(worker);
return NULL;
@@ -650,6 +674,11 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
err = -ENOMEM;
goto err_worker;
}
+   /*
+* vsock can already try to queue so make sure the worker
+* is setup before vhost_vq_work_queue sees vq->worker is set.
+*/
+   smp_wmb();
 
for (i = 0; i < dev->nvqs; i++)
dev->vqs[i]->worker = worker;
@@ -751,7 +780,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
dev->iotlb = NULL;
vhost_clear_msg(dev);
wake_up_interruptible_poll(&dev->wait, EPOLLIN | EPOLLRDNORM);
-   vhost_worker_free(dev);
+   vhost_workers_free(dev);
vhost_detach_mm(dev);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b850f534bc9a..31937e98c01b 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -30,6 +30,7 @@ struct vhost_worker {

[PATCH v8 17/17] vhost: Allow worker switching while work is queueing

2023-06-12 Thread Mike Christie

This patch drops the requirement that we can only switch workers if work
has not been queued by using RCU for the vq based queueing paths and a
mutex for the device wide flush.

We can also use this to support SIGKILL properly in the future where we
should exit almost immediately after getting that signal. With this
patch, when get_signal returns true, we can set the vq->worker to NULL
and do a synchronize_rcu to prevent new work from being queued to the
vhost_task that has been killed.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c  | 151 +++--
 drivers/vhost/vhost.h  |   4 +-
 include/uapi/linux/vhost.h |   4 +-
 3 files changed, 114 insertions(+), 45 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a4f97b56f1b4..ab79b064aade 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -233,16 +233,9 @@ void vhost_poll_stop(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
-static bool vhost_worker_queue(struct vhost_worker *worker,
+static void vhost_worker_queue(struct vhost_worker *worker,
   struct vhost_work *work)
 {
-   if (!worker)
-   return false;
-   /*
-* vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb
-* when setting up the worker. We don't have a smp_rmb here because
-* test_and_set_bit gives us a mb already.
-*/
if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
/* We can only add the work to the list after we're
 * sure it was not in the list.
@@ -251,47 +244,85 @@ static bool vhost_worker_queue(struct vhost_worker 
*worker,
llist_add(&work->node, &worker->work_list);
vhost_task_wake(worker->vtsk);
}
-
-   return true;
 }
 
 bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
 {
-   return vhost_worker_queue(vq->worker, work);
+   struct vhost_worker *worker;
+   bool queued = false;
+
+   rcu_read_lock();
+   worker = rcu_dereference(vq->worker);
+   if (worker) {
+   queued = true;
+   vhost_worker_queue(worker, work);
+   }
+   rcu_read_unlock();
+
+   return queued;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
 
-static void vhost_worker_flush(struct vhost_worker *worker)
+void vhost_vq_flush(struct vhost_virtqueue *vq)
 {
struct vhost_flush_struct flush;
 
init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);
 
-   if (vhost_worker_queue(worker, &flush.work))
+   if (vhost_vq_work_queue(vq, &flush.work))
wait_for_completion(&flush.wait_event);
 }
+EXPORT_SYMBOL_GPL(vhost_vq_flush);
 
-void vhost_vq_flush(struct vhost_virtqueue *vq)
+/**
+ * vhost_worker_flush - flush a worker
+ * @worker: worker to flush
+ *
+ * This does not use RCU to protect the worker, so the device or worker
+ * mutex must be held.
+ */
+static void vhost_worker_flush(struct vhost_worker *worker)
 {
-   vhost_worker_flush(vq->worker);
+   struct vhost_flush_struct flush;
+
+   init_completion(&flush.wait_event);
+   vhost_work_init(&flush.work, vhost_flush_work);
+
+   vhost_worker_queue(worker, &flush.work);
+   wait_for_completion(&flush.wait_event);
 }
-EXPORT_SYMBOL_GPL(vhost_vq_flush);
 
 void vhost_dev_flush(struct vhost_dev *dev)
 {
struct vhost_worker *worker;
unsigned long i;
 
-   xa_for_each(&dev->worker_xa, i, worker)
+   xa_for_each(&dev->worker_xa, i, worker) {
+   mutex_lock(&worker->mutex);
+   if (!worker->attachment_cnt) {
+   mutex_unlock(&worker->mutex);
+   continue;
+   }
vhost_worker_flush(worker);
+   mutex_unlock(&worker->mutex);
+   }
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_vq_has_work(struct vhost_virtqueue *vq)
 {
-   return !llist_empty(&vq->worker->work_list);
+   struct vhost_worker *worker;
+   bool has_work = false;
+
+   rcu_read_lock();
+   worker = rcu_dereference(vq->worker);
+   if (worker && !llist_empty(&worker->work_list))
+   has_work = true;
+   rcu_read_unlock();
+
+   return has_work;
 }
 EXPORT_SYMBOL_GPL(vhost_vq_has_work);
 
@@ -501,7 +532,7 @@ void vhost_dev_init(struct vhost_dev *dev,
vq->log = NULL;
vq->indirect = NULL;
vq->heads = NULL;
-   vq->worker = NULL;
+   rcu_assign_pointer(vq->worker, NULL);
vq->dev = dev;
mutex_init(&vq->mutex);

[PATCH v8 16/17] vhost_scsi: add support for worker ioctls

2023-06-12 Thread Mike Christie

This has vhost-scsi support the worker ioctls by calling the
vhost_worker_ioctl helper.

With a single worker, the single thread becomes a bottlneck when trying
to use 3 or more virtqueues like:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128  --numjobs=3

With the patches and doing a worker per vq, we can scale to at least
16 vCPUs/vqs (that's my system limit) with the same command fio command
above with numjobs=16:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=64  --numjobs=16

which gives around 2002K IOPs.

Note that for testing I dropped depth to 64 above because the vhost/virt
layer supports only 1024 total commands per device. And the only tuning I
did was set LIO's emulate_pr to 0 to avoid LIO's PR lock in the main IO
path which becomes an issue at around 12 jobs/virtqueues.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 2c3cf487cc71..c83f7f043470 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1927,6 +1927,14 @@ vhost_scsi_ioctl(struct file *f,
if (copy_from_user(&features, featurep, sizeof features))
return -EFAULT;
return vhost_scsi_set_features(vs, features);
+   case VHOST_NEW_WORKER:
+   case VHOST_FREE_WORKER:
+   case VHOST_ATTACH_VRING_WORKER:
+   case VHOST_GET_VRING_WORKER:
+   mutex_lock(&vs->dev.mutex);
+   r = vhost_worker_ioctl(&vs->dev, ioctl, argp);
+   mutex_unlock(&vs->dev.mutex);
+   return r;
default:
mutex_lock(&vs->dev.mutex);
r = vhost_dev_ioctl(&vs->dev, ioctl, argp);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 10/17] vhost_scsi: convert to vhost_vq_work_queue

2023-06-12 Thread Mike Christie

Convert from vhost_work_queue to vhost_vq_work_queue so we can
remove vhost_work_queue.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index a77c53bb035a..1668009bd489 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -353,8 +353,9 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
if (se_cmd->se_cmd_flags & SCF_SCSI_TMR_CDB) {
struct vhost_scsi_tmf *tmf = container_of(se_cmd,
struct vhost_scsi_tmf, se_cmd);
+   struct vhost_virtqueue *vq = &tmf->svq->vq;
 
-   vhost_work_queue(&tmf->vhost->dev, &tmf->vwork);
+   vhost_vq_work_queue(vq, &tmf->vwork);
} else {
struct vhost_scsi_cmd *cmd = container_of(se_cmd,
struct vhost_scsi_cmd, tvc_se_cmd);
@@ -1332,11 +1333,9 @@ static void vhost_scsi_ctl_handle_kick(struct vhost_work 
*work)
 }
 
 static void
-vhost_scsi_send_evt(struct vhost_scsi *vs,
-  struct vhost_scsi_tpg *tpg,
-  struct se_lun *lun,
-  u32 event,
-  u32 reason)
+vhost_scsi_send_evt(struct vhost_scsi *vs, struct vhost_virtqueue *vq,
+   struct vhost_scsi_tpg *tpg, struct se_lun *lun,
+   u32 event, u32 reason)
 {
struct vhost_scsi_evt *evt;
 
@@ -1358,7 +1357,7 @@ vhost_scsi_send_evt(struct vhost_scsi *vs,
}
 
llist_add(&evt->list, &vs->vs_event_list);
-   vhost_work_queue(&vs->dev, &vs->vs_event_work);
+   vhost_vq_work_queue(vq, &vs->vs_event_work);
 }
 
 static void vhost_scsi_evt_handle_kick(struct vhost_work *work)
@@ -1372,7 +1371,8 @@ static void vhost_scsi_evt_handle_kick(struct vhost_work 
*work)
goto out;
 
if (vs->vs_events_missed)
-   vhost_scsi_send_evt(vs, NULL, NULL, VIRTIO_SCSI_T_NO_EVENT, 0);
+   vhost_scsi_send_evt(vs, vq, NULL, NULL, VIRTIO_SCSI_T_NO_EVENT,
+   0);
 out:
mutex_unlock(&vq->mutex);
 }
@@ -1991,7 +1991,7 @@ vhost_scsi_do_plug(struct vhost_scsi_tpg *tpg,
goto unlock;
 
if (vhost_has_feature(vq, VIRTIO_SCSI_F_HOTPLUG))
-   vhost_scsi_send_evt(vs, tpg, lun,
+   vhost_scsi_send_evt(vs, vq, tpg, lun,
   VIRTIO_SCSI_T_TRANSPORT_RESET, reason);
 unlock:
mutex_unlock(&vq->mutex);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 02/17] vhost: dynamically allocate vhost_worker

2023-06-12 Thread Mike Christie

This patchset allows us to allocate multiple workers, so this has us
move from the vhost_worker that's embedded in the vhost_dev to
dynamically allocating it.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 66 ---
 drivers/vhost/vhost.h |  4 +--
 2 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 82966ffb4a5c..dfd96cf6a152 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -235,36 +235,40 @@ void vhost_dev_flush(struct vhost_dev *dev)
 {
struct vhost_flush_struct flush;
 
-   if (dev->worker.vtsk) {
-   init_completion(&flush.wait_event);
-   vhost_work_init(&flush.work, vhost_flush_work);
+   init_completion(&flush.wait_event);
+   vhost_work_init(&flush.work, vhost_flush_work);
 
-   vhost_work_queue(dev, &flush.work);
+   if (vhost_work_queue(dev, &flush.work))
wait_for_completion(&flush.wait_event);
-   }
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
-void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 {
-   if (!dev->worker.vtsk)
-   return;
-
+   if (!dev->worker)
+   return false;
+   /*
+* vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb
+* when setting up the worker. We don't have a smp_rmb here because
+* test_and_set_bit gives us a mb already.
+*/
if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
/* We can only add the work to the list after we're
 * sure it was not in the list.
 * test_and_set_bit() implies a memory barrier.
 */
-   llist_add(&work->node, &dev->worker.work_list);
-   vhost_task_wake(dev->worker.vtsk);
+   llist_add(&work->node, &dev->worker->work_list);
+   vhost_task_wake(dev->worker->vtsk);
}
+
+   return true;
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-   return !llist_empty(&dev->worker.work_list);
+   return !llist_empty(&dev->worker->work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -458,8 +462,7 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   memset(&dev->worker, 0, sizeof(dev->worker));
-   init_llist_head(&dev->worker.work_list);
+   dev->worker = NULL;
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -533,30 +536,47 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 
 static void vhost_worker_free(struct vhost_dev *dev)
 {
-   if (!dev->worker.vtsk)
+   if (!dev->worker)
return;
 
-   WARN_ON(!llist_empty(&dev->worker.work_list));
-   vhost_task_stop(dev->worker.vtsk);
-   dev->worker.kcov_handle = 0;
-   dev->worker.vtsk = NULL;
+   WARN_ON(!llist_empty(&dev->worker->work_list));
+   vhost_task_stop(dev->worker->vtsk);
+   kfree(dev->worker);
+   dev->worker = NULL;
 }
 
 static int vhost_worker_create(struct vhost_dev *dev)
 {
+   struct vhost_worker *worker;
struct vhost_task *vtsk;
char name[TASK_COMM_LEN];
 
+   worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
+   if (!worker)
+   return -ENOMEM;
+
snprintf(name, sizeof(name), "vhost-%d", current->pid);
 
-   vtsk = vhost_task_create(vhost_worker, &dev->worker, name);
+   vtsk = vhost_task_create(vhost_worker, worker, name);
if (!vtsk)
-   return -ENOMEM;
+   goto free_worker;
+
+   init_llist_head(&worker->work_list);
+   worker->kcov_handle = kcov_common_handle();
+   worker->vtsk = vtsk;
+   /*
+* vsock can already try to queue so make sure llist and vtsk are both
+* set before vhost_work_queue sees dev->worker is set.
+*/
+   smp_wmb();
+   dev->worker = worker;
 
-   dev->worker.kcov_handle = kcov_common_handle();
-   dev->worker.vtsk = vtsk;
vhost_task_start(vtsk);
return 0;
+
+free_worker:
+   kfree(worker);
+   return -ENOMEM;
 }
 
 /* Caller should have device mutex */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index fc900be504b3..cb872cc4157a 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -44,7 +44,7 @@ struct vhost_poll {
 };
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
-void vhost_w

[PATCH v8 11/17] vhost_scsi: flush IO vqs then send TMF rsp

2023-06-12 Thread Mike Christie

With one worker we will always send the scsi cmd responses then send the
TMF rsp, because LIO will always complete the scsi cmds first then call
into us to send the TMF response.

With multiple workers, the IO vq workers could be running while the
TMF/ctl vq worker is running so this has us do a flush before completing
the TMF to make sure cmds are completed when it's work is later queued
and run.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 1668009bd489..2c3cf487cc71 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1133,12 +1133,27 @@ static void vhost_scsi_tmf_resp_work(struct vhost_work 
*work)
 {
struct vhost_scsi_tmf *tmf = container_of(work, struct vhost_scsi_tmf,
  vwork);
-   int resp_code;
+   struct vhost_virtqueue *ctl_vq, *vq;
+   int resp_code, i;
+
+   if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE) {
+   /*
+* Flush IO vqs that don't share a worker with the ctl to make
+* sure they have sent their responses before us.
+*/
+   ctl_vq = &tmf->vhost->vqs[VHOST_SCSI_VQ_CTL].vq;
+   for (i = VHOST_SCSI_VQ_IO; i < tmf->vhost->dev.nvqs; i++) {
+   vq = &tmf->vhost->vqs[i].vq;
+
+   if (vhost_vq_is_setup(vq) &&
+   vq->worker != ctl_vq->worker)
+   vhost_vq_flush(vq);
+   }
 
-   if (tmf->scsi_resp == TMR_FUNCTION_COMPLETE)
resp_code = VIRTIO_SCSI_S_FUNCTION_SUCCEEDED;
-   else
+   } else {
resp_code = VIRTIO_SCSI_S_FUNCTION_REJECTED;
+   }
 
vhost_scsi_send_tmf_resp(tmf->vhost, &tmf->svq->vq, tmf->in_iovs,
 tmf->vq_desc, &tmf->resp_iov, resp_code);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 07/17] vhost: convert poll work to be vq based

2023-06-12 Thread Mike Christie

This has the drivers pass in their poll to vq mapping and then converts
the core poll code to use the vq based helpers. In the next patches we
will allow vqs to be handled by different workers, so to allow drivers
to execute operations like queue, stop, flush, etc on specific polls/vqs
we need to know the mappings.

Signed-off-by: Mike Christie 
---
 drivers/vhost/net.c   | 6 --
 drivers/vhost/vhost.c | 8 +---
 drivers/vhost/vhost.h | 4 +++-
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 98bb957eb3b9..f2ed7167c848 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1347,8 +1347,10 @@ static int vhost_net_open(struct inode *inode, struct 
file *f)
   VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
   NULL);
 
-   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, 
dev);
-   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
+   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, dev,
+   vqs[VHOST_NET_VQ_TX]);
+   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev,
+   vqs[VHOST_NET_VQ_RX]);
 
f->private_data = n;
n->page_frag.page = NULL;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index d212cee064d6..9623264bb3a7 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -187,13 +187,15 @@ EXPORT_SYMBOL_GPL(vhost_work_init);
 
 /* Init poll structure */
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-__poll_t mask, struct vhost_dev *dev)
+__poll_t mask, struct vhost_dev *dev,
+struct vhost_virtqueue *vq)
 {
init_waitqueue_func_entry(&poll->wait, vhost_poll_wakeup);
init_poll_funcptr(&poll->table, vhost_poll_func);
poll->mask = mask;
poll->dev = dev;
poll->wqh = NULL;
+   poll->vq = vq;
 
vhost_work_init(&poll->work, fn);
 }
@@ -297,7 +299,7 @@ EXPORT_SYMBOL_GPL(vhost_vq_has_work);
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
-   vhost_work_queue(poll->dev, &poll->work);
+   vhost_vq_work_queue(poll->vq, &poll->work);
 }
 EXPORT_SYMBOL_GPL(vhost_poll_queue);
 
@@ -508,7 +510,7 @@ void vhost_dev_init(struct vhost_dev *dev,
vhost_vq_reset(dev, vq);
if (vq->handle_kick)
vhost_poll_init(&vq->poll, vq->handle_kick,
-   EPOLLIN, dev);
+   EPOLLIN, dev, vq);
}
 }
 EXPORT_SYMBOL_GPL(vhost_dev_init);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index ac1f4924e548..f44eecd4fcf9 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -41,13 +41,15 @@ struct vhost_poll {
struct vhost_work   work;
__poll_tmask;
struct vhost_dev*dev;
+   struct vhost_virtqueue  *vq;
 };
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
-__poll_t mask, struct vhost_dev *dev);
+__poll_t mask, struct vhost_dev *dev,
+struct vhost_virtqueue *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_queue(struct vhost_poll *poll);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 13/17] vhost: add helper to parse userspace vring state/file

2023-06-12 Thread Mike Christie

The next patches add new vhost worker ioctls which will need to get a
vhost_virtqueue from a userspace struct which specifies the vq's index.
This moves the vhost_vring_ioctl code to do this to a helper so it can
be shared.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 29 ++---
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 176d5fcd4b58..0db093dfaa22 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -599,6 +599,27 @@ static struct vhost_worker *vhost_worker_create(struct 
vhost_dev *dev)
return NULL;
 }
 
+static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp,
+ struct vhost_virtqueue **vq, u32 *id)
+{
+   u32 __user *idxp = argp;
+   u32 idx;
+   long r;
+
+   r = get_user(idx, idxp);
+   if (r < 0)
+   return r;
+
+   if (idx >= dev->nvqs)
+   return -ENOBUFS;
+
+   idx = array_index_nospec(idx, dev->nvqs);
+
+   *vq = dev->vqs[idx];
+   *id = idx;
+   return 0;
+}
+
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
@@ -1618,21 +1639,15 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned 
int ioctl, void __user *arg
struct file *eventfp, *filep = NULL;
bool pollstart = false, pollstop = false;
struct eventfd_ctx *ctx = NULL;
-   u32 __user *idxp = argp;
struct vhost_virtqueue *vq;
struct vhost_vring_state s;
struct vhost_vring_file f;
u32 idx;
long r;
 
-   r = get_user(idx, idxp);
+   r = vhost_get_vq_from_user(d, argp, &vq, &idx);
if (r < 0)
return r;
-   if (idx >= d->nvqs)
-   return -ENOBUFS;
-
-   idx = array_index_nospec(idx, d->nvqs);
-   vq = d->vqs[idx];
 
if (ioctl == VHOST_SET_VRING_NUM ||
ioctl == VHOST_SET_VRING_ADDR) {
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 04/17] vhost, vhost_net: add helper to check if vq has work

2023-06-12 Thread Mike Christie

In the next patches each vq might have different workers so one could
have work but others do not. For net, we only want to check specific vqs,
so this adds a helper to check if a vq has work pending and converts
vhost-net to use it.

Signed-off-by: Mike Christie 
Acked-by: Jason Wang 
---
 drivers/vhost/net.c   | 2 +-
 drivers/vhost/vhost.c | 6 +++---
 drivers/vhost/vhost.h | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ae2273196b0c..98bb957eb3b9 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -546,7 +546,7 @@ static void vhost_net_busy_poll(struct vhost_net *net,
endtime = busy_clock() + busyloop_timeout;
 
while (vhost_can_busy_poll(endtime)) {
-   if (vhost_has_work(&net->dev)) {
+   if (vhost_vq_has_work(vq)) {
*busyloop_intr = true;
break;
}
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index c78c15af97d3..a832ca692eb1 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -266,11 +266,11 @@ bool vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
 /* A lockless hint for busy polling code to exit the loop */
-bool vhost_has_work(struct vhost_dev *dev)
+bool vhost_vq_has_work(struct vhost_virtqueue *vq)
 {
-   return !llist_empty(&dev->worker->work_list);
+   return !llist_empty(&vq->worker->work_list);
 }
-EXPORT_SYMBOL_GPL(vhost_has_work);
+EXPORT_SYMBOL_GPL(vhost_vq_has_work);
 
 void vhost_poll_queue(struct vhost_poll *poll)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 206617edb2a9..37c183b37c42 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -45,7 +45,6 @@ struct vhost_poll {
 
 void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
-bool vhost_has_work(struct vhost_dev *dev);
 
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 __poll_t mask, struct vhost_dev *dev);
@@ -199,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
+bool vhost_vq_has_work(struct vhost_virtqueue *vq);
 bool vhost_vq_is_setup(struct vhost_virtqueue *vq);
 int vhost_vq_init_access(struct vhost_virtqueue *);
 int vhost_add_used(struct vhost_virtqueue *, unsigned int head, int len);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 08/17] vhost_sock: convert to vhost_vq_work_queue

2023-06-12 Thread Mike Christie

Convert from vhost_work_queue to vhost_vq_work_queue, so we can drop
vhost_work_queue.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vsock.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 6578db78f0ae..817d377a3f36 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -285,7 +285,7 @@ vhost_transport_send_pkt(struct sk_buff *skb)
atomic_inc(&vsock->queued_replies);
 
virtio_vsock_skb_queue_tail(&vsock->send_pkt_queue, skb);
-   vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
+   vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
 
rcu_read_unlock();
return len;
@@ -583,7 +583,7 @@ static int vhost_vsock_start(struct vhost_vsock *vsock)
/* Some packets may have been queued before the device was started,
 * let's kick the send worker to send them.
 */
-   vhost_work_queue(&vsock->dev, &vsock->send_pkt_work);
+   vhost_vq_work_queue(&vsock->vqs[VSOCK_VQ_RX], &vsock->send_pkt_work);
 
mutex_unlock(&vsock->dev.mutex);
return 0;
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 15/17] vhost: allow userspace to create workers

2023-06-12 Thread Mike Christie

For vhost-scsi with 3 vqs or more and a workload that tries to use
them in parallel like:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128  --numjobs=3

the single vhost worker thread will become a bottlneck and we are stuck
at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are
used.

To better utilize virtqueues and available CPUs, this patch allows
userspace to create workers and bind them to vqs. You can have N workers
per dev and also share N workers with M vqs on that dev.

This patch adds the interface related code and the next patch will hook
vhost-scsi into it. The patches do not try to hook net and vsock into
the interface because:

1. multiple workers don't seem to help vsock. The problem is that with
only 2 virtqueues we never fully use the existing worker when doing
bidirectional tests. This seems to match vhost-scsi where we don't see
the worker as a bottleneck until 3 virtqueues are used.

2. net already has a way to use multiple workers.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c| 141 ++-
 drivers/vhost/vhost.h|   3 +
 include/uapi/linux/vhost.h   |  33 
 include/uapi/linux/vhost_types.h |  16 
 4 files changed, 192 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index f07f92a67e59..a4f97b56f1b4 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -623,6 +623,76 @@ static struct vhost_worker *vhost_worker_create(struct 
vhost_dev *dev)
return NULL;
 }
 
+/* Caller must have device mutex */
+static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
+struct vhost_worker *worker)
+{
+   if (vq->worker)
+   vq->worker->attachment_cnt--;
+   worker->attachment_cnt++;
+   vq->worker = worker;
+}
+
+ /* Caller must have device and virtqueue mutex */
+static int vhost_vq_attach_worker(struct vhost_virtqueue *vq,
+ struct vhost_vring_worker *info)
+{
+   unsigned long index = info->worker_id;
+   struct vhost_dev *dev = vq->dev;
+   struct vhost_worker *worker;
+
+   if (!dev->use_worker)
+   return -EINVAL;
+
+   /*
+* We only allow userspace to set a virtqueue's worker if it's not
+* active and polling is not enabled. We also assume drivers
+* supporting this will not be internally queueing works directly or
+* via calls like vhost_dev_flush at this time.
+*/
+   if (vhost_vq_get_backend(vq) || vq->kick)
+   return -EBUSY;
+
+   worker = xa_find(&dev->worker_xa, &index, UINT_MAX, XA_PRESENT);
+   if (!worker || worker->id != info->worker_id)
+   return -ENODEV;
+
+   __vhost_vq_attach_worker(vq, worker);
+   return 0;
+}
+
+/* Caller must have device mutex */
+static int vhost_new_worker(struct vhost_dev *dev,
+   struct vhost_worker_state *info)
+{
+   struct vhost_worker *worker;
+
+   worker = vhost_worker_create(dev);
+   if (!worker)
+   return -ENOMEM;
+
+   info->worker_id = worker->id;
+   return 0;
+}
+
+/* Caller must have device mutex */
+static int vhost_free_worker(struct vhost_dev *dev,
+struct vhost_worker_state *info)
+{
+   unsigned long index = info->worker_id;
+   struct vhost_worker *worker;
+
+   worker = xa_find(&dev->worker_xa, &index, UINT_MAX, XA_PRESENT);
+   if (!worker || worker->id != info->worker_id)
+   return -ENODEV;
+
+   if (worker->attachment_cnt)
+   return -EBUSY;
+
+   vhost_worker_destroy(dev, worker);
+   return 0;
+}
+
 static int vhost_get_vq_from_user(struct vhost_dev *dev, void __user *argp,
  struct vhost_virtqueue **vq, u32 *id)
 {
@@ -644,6 +714,75 @@ static int vhost_get_vq_from_user(struct vhost_dev *dev, 
void __user *argp,
return 0;
 }
 
+/* Caller must have device mutex */
+long vhost_worker_ioctl(struct vhost_dev *dev, unsigned int ioctl,
+   void __user *argp)
+{
+   struct vhost_vring_worker ring_worker;
+   struct vhost_worker_state state;
+   struct vhost_virtqueue *vq;
+   long ret;
+   u32 idx;
+
+   if (!dev->use_worker)
+   return -EINVAL;
+
+   if (!vhost_dev_has_owner(dev))
+   return -EINVAL;
+
+   switch (ioctl) {
+   /* dev worker ioctls */
+   case VHOST_NEW_WORKER:
+   ret = vhost_new_worker(dev, &state);
+   if (!ret && copy_to_user(argp, &state, sizeof(state)))
+   ret = -EFAULT;
+   return ret;
+   case VHOST_FREE_WORKER:
+   if (copy_from_user(&state, argp, sizeof(state)))
+

[PATCH v8 12/17] vhost: remove vhost_work_queue

2023-06-12 Thread Mike Christie

vhost_work_queue is no longer used. Each driver is using the poll or vq
based queueing, so remove vhost_work_queue.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 6 --
 drivers/vhost/vhost.h | 5 ++---
 2 files changed, 2 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 9623264bb3a7..176d5fcd4b58 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -255,12 +255,6 @@ static bool vhost_worker_queue(struct vhost_worker *worker,
return true;
 }
 
-bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
-{
-   return vhost_worker_queue(dev->worker, work);
-}
-EXPORT_SYMBOL_GPL(vhost_work_queue);
-
 bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
 {
return vhost_worker_queue(vq->worker, work);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index f44eecd4fcf9..b850f534bc9a 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -44,15 +44,14 @@ struct vhost_poll {
struct vhost_virtqueue  *vq;
 };
 
-void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
-bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work);
-
 void vhost_poll_init(struct vhost_poll *poll, vhost_work_fn_t fn,
 __poll_t mask, struct vhost_dev *dev,
 struct vhost_virtqueue *vq);
 int vhost_poll_start(struct vhost_poll *poll, struct file *file);
 void vhost_poll_stop(struct vhost_poll *poll);
 void vhost_poll_queue(struct vhost_poll *poll);
+
+void vhost_work_init(struct vhost_work *work, vhost_work_fn_t fn);
 void vhost_dev_flush(struct vhost_dev *dev);
 
 struct vhost_log {
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 00/17] vhost: multiple worker support

2023-06-12 Thread Mike Christie

The following patches were built over Linux's tree. The also apply over
the mst vhost branch as well.

The patcjes allow us to support multiple vhost workers tasks per device.
The design is a modified version of Stefan's original idea where userspace
has the kernel create a worker and we pass back the pid. In this version
instead of passing the pid between user/kernel space we use a worker_id
which is just an integer managed by the vhost driver and we allow
userspace to create and free workers and then attach them to virtqueues
at setup time.

All review comments from the past reviews should be handled. If I didn't
reply to a review comment, I agreed with the comment and should have
handled it in this posting. Let me know if I missed one.

Results:


fio jobs1   2   4   8   12  16
--
1 worker160k   488k -   -   -   -
worker per vq   160k   310k620k1182K1564K   2002k

Notes:
0. This used a simple fio command:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128  --numjobs=$JOBS_ABOVE

and I used a VM with 16 vCPUs and 16 virtqueues.

1. The patches were tested with LIO's emulate_pr=0 which drops the
LIO PR lock use. This was a bottleneck at around 12 vqs/jobs.

2. Because we have a hard limit of 1024 cmds, if the num jobs * iodepth
was greater than 1024, I would decrease iodepth. So 12 jobs used 85 cmds,
and 16 used 64.

3. The perf issue above at 2 jobs is because when we only have 1 worker
we execute more cmds per vhost_work due to all vqs funneling to one worker.
I have 2 patches that fix this. One is just submit from the worker thread
instead of kikcing off to a workqueue like how the vhost block patches do.
I'll post this after the worker support is merged. I'm also working on
patches to add back batching during lio completion and do polling on the
submission side.

We will still want the threading patches, because if we batch at the fio
level plus use the vhost theading patches, we can see a big boost like
below. So hopefully doing it at the kernel will allow apps to just work
without having to be smart like fio.

fio using io_uring and batching with the iodepth_batch* settings:

fio jobs1   2   4   8   12  16
-
1 worker494k520k-   -   -   -
worker per vq   496k878k1542k   2436k   2304k   2590k

V8:
- Rebase.
- Move comments about assumptions so it's in the body of the code
instead of top of function so it's more clear.
- Added patch for RCU support and swapping workers while works are
running.

V7:
- Add more comments about assumptions.
- Drop refcounting and just use an attachment_cnt variable, so there
is no confusion about when a worker is freed.
- Do a opt-in model, because vsiock has an issue where it can queue works
before it's running and it doesn't even need multiple workers, so there
is no point in chaning the driver or core code.
- Add back get worker ioctl.
- Broke up last patches to make it easier to read.

V6:
- Rebase against vhost_task patchset.
- Used xa instead of idr.
V5:
- Rebase against user_worker patchset.
- Rebase against flush patchset.
- Redo vhost-scsi tmf flush handling so it doesn't access vq->worker.
V4:
- fix vhost-sock VSOCK_VQ_RX use.
- name functions called directly by ioctl cmd's to match the ioctl cmd.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 09/17] vhost_scsi: make SCSI cmd completion per vq

2023-06-12 Thread Mike Christie

This patch separates the scsi cmd completion code paths so we can complete
cmds based on their vq instead of having all cmds complete on the same
worker/CPU. This will be useful with the next patches that allow us to
create mulitple worker threads and bind them to different vqs, and we can
have completions running on different threads/CPUs.

Signed-off-by: Mike Christie 
Reviewed-by: Stefan Hajnoczi 
---
 drivers/vhost/scsi.c | 56 
 1 file changed, 26 insertions(+), 30 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index bb10fa4bb4f6..a77c53bb035a 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -167,6 +167,7 @@ MODULE_PARM_DESC(max_io_vqs, "Set the max number of IO 
virtqueues a vhost scsi d
 
 struct vhost_scsi_virtqueue {
struct vhost_virtqueue vq;
+   struct vhost_scsi *vs;
/*
 * Reference counting for inflight reqs, used for flush operation. At
 * each time, one reference tracks new commands submitted, while we
@@ -181,6 +182,9 @@ struct vhost_scsi_virtqueue {
struct vhost_scsi_cmd *scsi_cmds;
struct sbitmap scsi_tags;
int max_cmds;
+
+   struct vhost_work completion_work;
+   struct llist_head completion_list;
 };
 
 struct vhost_scsi {
@@ -190,12 +194,8 @@ struct vhost_scsi {
 
struct vhost_dev dev;
struct vhost_scsi_virtqueue *vqs;
-   unsigned long *compl_bitmap;
struct vhost_scsi_inflight **old_inflight;
 
-   struct vhost_work vs_completion_work; /* cmd completion work item */
-   struct llist_head vs_completion_list; /* cmd completion queue */
-
struct vhost_work vs_event_work; /* evt injection work item */
struct llist_head vs_event_list; /* evt injection queue */
 
@@ -358,10 +358,11 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
} else {
struct vhost_scsi_cmd *cmd = container_of(se_cmd,
struct vhost_scsi_cmd, tvc_se_cmd);
-   struct vhost_scsi *vs = cmd->tvc_vhost;
+   struct vhost_scsi_virtqueue *svq =  container_of(cmd->tvc_vq,
+   struct vhost_scsi_virtqueue, vq);
 
-   llist_add(&cmd->tvc_completion_list, &vs->vs_completion_list);
-   vhost_work_queue(&vs->dev, &vs->vs_completion_work);
+   llist_add(&cmd->tvc_completion_list, &svq->completion_list);
+   vhost_vq_work_queue(&svq->vq, &svq->completion_work);
}
 }
 
@@ -509,17 +510,17 @@ static void vhost_scsi_evt_work(struct vhost_work *work)
  */
 static void vhost_scsi_complete_cmd_work(struct vhost_work *work)
 {
-   struct vhost_scsi *vs = container_of(work, struct vhost_scsi,
-   vs_completion_work);
+   struct vhost_scsi_virtqueue *svq = container_of(work,
+   struct vhost_scsi_virtqueue, completion_work);
struct virtio_scsi_cmd_resp v_rsp;
struct vhost_scsi_cmd *cmd, *t;
struct llist_node *llnode;
struct se_cmd *se_cmd;
struct iov_iter iov_iter;
-   int ret, vq;
+   bool signal = false;
+   int ret;
 
-   bitmap_zero(vs->compl_bitmap, vs->dev.nvqs);
-   llnode = llist_del_all(&vs->vs_completion_list);
+   llnode = llist_del_all(&svq->completion_list);
llist_for_each_entry_safe(cmd, t, llnode, tvc_completion_list) {
se_cmd = &cmd->tvc_se_cmd;
 
@@ -539,21 +540,17 @@ static void vhost_scsi_complete_cmd_work(struct 
vhost_work *work)
  cmd->tvc_in_iovs, sizeof(v_rsp));
ret = copy_to_iter(&v_rsp, sizeof(v_rsp), &iov_iter);
if (likely(ret == sizeof(v_rsp))) {
-   struct vhost_scsi_virtqueue *q;
+   signal = true;
+
vhost_add_used(cmd->tvc_vq, cmd->tvc_vq_desc, 0);
-   q = container_of(cmd->tvc_vq, struct 
vhost_scsi_virtqueue, vq);
-   vq = q - vs->vqs;
-   __set_bit(vq, vs->compl_bitmap);
} else
pr_err("Faulted on virtio_scsi_cmd_resp\n");
 
vhost_scsi_release_cmd_res(se_cmd);
}
 
-   vq = -1;
-   while ((vq = find_next_bit(vs->compl_bitmap, vs->dev.nvqs, vq + 1))
-   < vs->dev.nvqs)
-   vhost_signal(&vs->dev, &vs->vqs[vq].vq);
+   if (signal)
+   vhost_signal(&svq->vs->dev, &svq->vq);
 }
 
 static struct vhost_scsi_cmd *
@@ -1770,6 +1767,7 @@ static int vhost_scsi_set_features(struct vhost_scsi *vs, 
u64 features)
 
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
+   struct vhost_scsi_virtqueue *svq;

[PATCH v8 03/17] vhost: add vhost_worker pointer to vhost_virtqueue

2023-06-12 Thread Mike Christie

This patchset allows userspace to map vqs to different workers. This
patch adds a worker pointer to the vq so in later patches in this set
we can queue/flush specific vqs and their workers.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 21 ++---
 drivers/vhost/vhost.h |  1 +
 2 files changed, 15 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index dfd96cf6a152..c78c15af97d3 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -479,6 +479,7 @@ void vhost_dev_init(struct vhost_dev *dev,
vq->log = NULL;
vq->indirect = NULL;
vq->heads = NULL;
+   vq->worker = NULL;
vq->dev = dev;
mutex_init(&vq->mutex);
vhost_vq_reset(dev, vq);
@@ -545,7 +546,7 @@ static void vhost_worker_free(struct vhost_dev *dev)
dev->worker = NULL;
 }
 
-static int vhost_worker_create(struct vhost_dev *dev)
+static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 {
struct vhost_worker *worker;
struct vhost_task *vtsk;
@@ -553,7 +554,7 @@ static int vhost_worker_create(struct vhost_dev *dev)
 
worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
if (!worker)
-   return -ENOMEM;
+   return NULL;
 
snprintf(name, sizeof(name), "vhost-%d", current->pid);
 
@@ -572,17 +573,18 @@ static int vhost_worker_create(struct vhost_dev *dev)
dev->worker = worker;
 
vhost_task_start(vtsk);
-   return 0;
+   return worker;
 
 free_worker:
kfree(worker);
-   return -ENOMEM;
+   return NULL;
 }
 
 /* Caller should have device mutex */
 long vhost_dev_set_owner(struct vhost_dev *dev)
 {
-   int err;
+   struct vhost_worker *worker;
+   int err, i;
 
/* Is there an owner already? */
if (vhost_dev_has_owner(dev)) {
@@ -603,9 +605,14 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 * below since we don't have to worry about vsock queueing
 * while we free the worker.
 */
-   err = vhost_worker_create(dev);
-   if (err)
+   worker = vhost_worker_create(dev);
+   if (!worker) {
+   err = -ENOMEM;
goto err_worker;
+   }
+
+   for (i = 0; i < dev->nvqs; i++)
+   dev->vqs[i]->worker = worker;
}
 
return 0;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index cb872cc4157a..206617edb2a9 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -74,6 +74,7 @@ struct vhost_vring_call {
 /* The virtqueue structure describes a queue attached to a device. */
 struct vhost_virtqueue {
struct vhost_dev *dev;
+   struct vhost_worker *worker;
 
/* The actual ring of buffers. */
struct mutex mutex;
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 06/17] vhost: take worker or vq for flushing

2023-06-12 Thread Mike Christie

This patch has the core work flush function take a worker. When we
support multiple workers we can then flush each worker during device
removal, stoppage, etc. It also adds a helper to flush specific
virtqueues, so vhost-scsi can flush IO vqs from it's ctl vq.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 15 +--
 drivers/vhost/vhost.h |  1 +
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 16630c19bcc2..d212cee064d6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -265,16 +265,27 @@ bool vhost_vq_work_queue(struct vhost_virtqueue *vq, 
struct vhost_work *work)
 }
 EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
 
-void vhost_dev_flush(struct vhost_dev *dev)
+static void vhost_worker_flush(struct vhost_worker *worker)
 {
struct vhost_flush_struct flush;
 
init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);
 
-   if (vhost_work_queue(dev, &flush.work))
+   if (vhost_worker_queue(worker, &flush.work))
wait_for_completion(&flush.wait_event);
 }
+
+void vhost_vq_flush(struct vhost_virtqueue *vq)
+{
+   vhost_worker_flush(vq->worker);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_flush);
+
+void vhost_dev_flush(struct vhost_dev *dev)
+{
+   vhost_worker_flush(dev->worker);
+}
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
 /* A lockless hint for busy polling code to exit the loop */
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 6a1ae8ae9c7d..ac1f4924e548 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -198,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
+void vhost_vq_flush(struct vhost_virtqueue *vq);
 bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
 bool vhost_vq_has_work(struct vhost_virtqueue *vq);
 bool vhost_vq_is_setup(struct vhost_virtqueue *vq);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 05/17] vhost: take worker or vq instead of dev for queueing

2023-06-12 Thread Mike Christie

This patch has the core work queueing function take a worker for when we
support multiple workers. It also adds a helper that takes a vq during
queueing so modules can control which vq/worker to queue work on.

This temp leaves vhost_work_queue. It will be removed when the drivers
are converted in the next patches.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 44 +++
 drivers/vhost/vhost.h |  1 +
 2 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a832ca692eb1..16630c19bcc2 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -231,21 +231,10 @@ void vhost_poll_stop(struct vhost_poll *poll)
 }
 EXPORT_SYMBOL_GPL(vhost_poll_stop);
 
-void vhost_dev_flush(struct vhost_dev *dev)
+static bool vhost_worker_queue(struct vhost_worker *worker,
+  struct vhost_work *work)
 {
-   struct vhost_flush_struct flush;
-
-   init_completion(&flush.wait_event);
-   vhost_work_init(&flush.work, vhost_flush_work);
-
-   if (vhost_work_queue(dev, &flush.work))
-   wait_for_completion(&flush.wait_event);
-}
-EXPORT_SYMBOL_GPL(vhost_dev_flush);
-
-bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
-{
-   if (!dev->worker)
+   if (!worker)
return false;
/*
 * vsock can queue while we do a VHOST_SET_OWNER, so we have a smp_wmb
@@ -257,14 +246,37 @@ bool vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 * sure it was not in the list.
 * test_and_set_bit() implies a memory barrier.
 */
-   llist_add(&work->node, &dev->worker->work_list);
-   vhost_task_wake(dev->worker->vtsk);
+   llist_add(&work->node, &worker->work_list);
+   vhost_task_wake(worker->vtsk);
}
 
return true;
 }
+
+bool vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
+{
+   return vhost_worker_queue(dev->worker, work);
+}
 EXPORT_SYMBOL_GPL(vhost_work_queue);
 
+bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work)
+{
+   return vhost_worker_queue(vq->worker, work);
+}
+EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
+
+void vhost_dev_flush(struct vhost_dev *dev)
+{
+   struct vhost_flush_struct flush;
+
+   init_completion(&flush.wait_event);
+   vhost_work_init(&flush.work, vhost_flush_work);
+
+   if (vhost_work_queue(dev, &flush.work))
+   wait_for_completion(&flush.wait_event);
+}
+EXPORT_SYMBOL_GPL(vhost_dev_flush);
+
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_vq_has_work(struct vhost_virtqueue *vq)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 37c183b37c42..6a1ae8ae9c7d 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -198,6 +198,7 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct vhost_log *log, unsigned int *log_num);
 void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
 
+bool vhost_vq_work_queue(struct vhost_virtqueue *vq, struct vhost_work *work);
 bool vhost_vq_has_work(struct vhost_virtqueue *vq);
 bool vhost_vq_is_setup(struct vhost_virtqueue *vq);
 int vhost_vq_init_access(struct vhost_virtqueue *);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v8 01/17] vhost: create worker at end of vhost_dev_set_owner

2023-06-12 Thread Mike Christie

vsock can start queueing work after VHOST_VSOCK_SET_GUEST_CID, so
after we have called vhost_worker_create it can be calling
vhost_work_queue and trying to access the vhost worker/task. If
vhost_dev_alloc_iovecs fails, then vhost_worker_free could free
the worker/task from under vsock.

This moves vhost_worker_create to the end of vhost_dev_set_owner
where we know we can no longer fail in that path. If it fails
after the VHOST_SET_OWNER and userspace closes the device, then
the normal vsock release handling will do the right thing.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 19 +--
 1 file changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 60c9ebd629dd..82966ffb4a5c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -572,20 +572,27 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 
vhost_attach_mm(dev);
 
+   err = vhost_dev_alloc_iovecs(dev);
+   if (err)
+   goto err_iovecs;
+
if (dev->use_worker) {
+   /*
+* This should be done last, because vsock can queue work
+* before VHOST_SET_OWNER so it simplifies the failure path
+* below since we don't have to worry about vsock queueing
+* while we free the worker.
+*/
err = vhost_worker_create(dev);
if (err)
goto err_worker;
}
 
-   err = vhost_dev_alloc_iovecs(dev);
-   if (err)
-   goto err_iovecs;
-
return 0;
-err_iovecs:
-   vhost_worker_free(dev);
+
 err_worker:
+   vhost_dev_free_iovecs(dev);
+err_iovecs:
vhost_detach_mm(dev);
 err_mm:
return err;
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: vhost: Fix vhost_task regressions in 6.4-rc

2023-06-07 Thread Mike Christie

On 6/7/23 3:22 PM, Michael S. Tsirkin wrote:
> On Wed, Jun 07, 2023 at 02:23:36PM -0500, Mike Christie wrote:
>> The following patches were made over Linus's tree which contains a
>> vhost change missing in mst's vhost branch. These patches fix two
>> issues caused by the vhost_task patches:
>> 1. I was setting dev->worker too early and this caused crashes when
>> vsock would queue work before VHOST_SET_OWNER.
>>
>> 2. The patch that Linus's tree contains which vhost does not yet
>> have converted vhost_tasks to use CLONE_THREAD. That required a
>> change to how we set the task state, but I completely messed up
>> and moved when we set ourself to interruptible too late.
>>
> 
> Right. and that's in rc5 yes?
> 

Yes.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

vhost: Fix vhost_task regressions in 6.4-rc

2023-06-07 Thread Mike Christie

The following patches were made over Linus's tree which contains a
vhost change missing in mst's vhost branch. These patches fix two
issues caused by the vhost_task patches:
1. I was setting dev->worker too early and this caused crashes when
vsock would queue work before VHOST_SET_OWNER.

2. The patch that Linus's tree contains which vhost does not yet
have converted vhost_tasks to use CLONE_THREAD. That required a
change to how we set the task state, but I completely messed up
and moved when we set ourself to interruptible too late.



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 2/2] vhost: Fix worker hangs due to missed wake up calls

2023-06-07 Thread Mike Christie

We can race where we have added work to the work_list, but
vhost_task_fn has passed that check but not yet set us into
TASK_INTERRUPTIBLE. wake_up_process will see us in TASK_RUNNING and
just return.

This bug was intoduced in commit f9010dbdce91 ("fork, vhost: Use
CLONE_THREAD to fix freezer/ps regression") when I moved the setting
of TASK_INTERRUPTIBLE to simplfy the code and avoid get_signal from
logging warnings about being in the wrong state. This moves the setting
of TASK_INTERRUPTIBLE back to before we test if we need to stop the
task to avoid a possible race there as well. We then have vhost_worker
set TASK_RUNNING if it finds work similar to before.

Fixes: f9010dbdce91 ("fork, vhost: Use CLONE_THREAD to fix freezer/ps 
regression")
Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c |  2 ++
 kernel/vhost_task.c   | 16 +---
 2 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 7a9f93eae225..b2722d29e069 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -341,6 +341,8 @@ static bool vhost_worker(void *data)
 
node = llist_del_all(&worker->work_list);
if (node) {
+   __set_current_state(TASK_RUNNING);
+
node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
smp_wmb();
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index f80d5c51ae67..da35e5b7f047 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -28,10 +28,6 @@ static int vhost_task_fn(void *data)
for (;;) {
bool did_work;
 
-   /* mb paired w/ vhost_task_stop */
-   if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags))
-   break;
-
if (!dead && signal_pending(current)) {
struct ksignal ksig;
/*
@@ -48,11 +44,17 @@ static int vhost_task_fn(void *data)
clear_thread_flag(TIF_SIGPENDING);
}
 
+   /* mb paired w/ vhost_task_stop */
+   set_current_state(TASK_INTERRUPTIBLE);
+
+   if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
+   __set_current_state(TASK_RUNNING);
+   break;
+   }
+
did_work = vtsk->fn(vtsk->data);
-   if (!did_work) {
-   set_current_state(TASK_INTERRUPTIBLE);
+   if (!did_work)
schedule();
-   }
}
 
complete(&vtsk->exited);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 1/2] vhost: Fix crash during early vhost_transport_send_pkt calls

2023-06-07 Thread Mike Christie

If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we
can race where:
1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue
2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create.
3. vhost_worker_create will set the dev->worker pointer before setting
the worker->vtsk pointer.
4. thread0's vhost_work_queue will see the dev->worker pointer is
set and try to call vhost_task_wake using not yet set worker->vtsk
pointer.
5. We then crash since vtsk is NULL.

Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker
threads"), we only had the worker pointer so we could just check it to
see if VHOST_SET_OWNER has been done. After that commit we have the
vhost_worker and vhost_task pointer, so we can now hit the bug above.

This patch embeds the vhost_worker in the vhost_dev and moves the work
list initialization back to vhost_dev_init, so we can just check the
worker.vtsk pointer to check if VHOST_SET_OWNER has been done like
before.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 50 +++
 drivers/vhost/vhost.h |  2 +-
 2 files changed, 18 insertions(+), 34 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 074273020849..7a9f93eae225 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -235,7 +235,7 @@ void vhost_dev_flush(struct vhost_dev *dev)
 {
struct vhost_flush_struct flush;
 
-   if (dev->worker) {
+   if (dev->worker.vtsk) {
init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);
 
@@ -247,7 +247,7 @@ EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 {
-   if (!dev->worker)
+   if (!dev->worker.vtsk)
return;
 
if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
@@ -255,8 +255,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 * sure it was not in the list.
 * test_and_set_bit() implies a memory barrier.
 */
-   llist_add(&work->node, &dev->worker->work_list);
-   vhost_task_wake(dev->worker->vtsk);
+   llist_add(&work->node, &dev->worker.work_list);
+   vhost_task_wake(dev->worker.vtsk);
}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -264,7 +264,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-   return dev->worker && !llist_empty(&dev->worker->work_list);
+   return !llist_empty(&dev->worker.work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -456,7 +456,8 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   dev->worker = NULL;
+   memset(&dev->worker, 0, sizeof(dev->worker));
+   init_llist_head(&dev->worker.work_list);
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -530,47 +531,30 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 
 static void vhost_worker_free(struct vhost_dev *dev)
 {
-   struct vhost_worker *worker = dev->worker;
-
-   if (!worker)
+   if (!dev->worker.vtsk)
return;
 
-   dev->worker = NULL;
-   WARN_ON(!llist_empty(&worker->work_list));
-   vhost_task_stop(worker->vtsk);
-   kfree(worker);
+   WARN_ON(!llist_empty(&dev->worker.work_list));
+   vhost_task_stop(dev->worker.vtsk);
+   dev->worker.kcov_handle = 0;
+   dev->worker.vtsk = NULL;
 }
 
 static int vhost_worker_create(struct vhost_dev *dev)
 {
-   struct vhost_worker *worker;
struct vhost_task *vtsk;
char name[TASK_COMM_LEN];
-   int ret;
-
-   worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
-   if (!worker)
-   return -ENOMEM;
 
-   dev->worker = worker;
-   worker->kcov_handle = kcov_common_handle();
-   init_llist_head(&worker->work_list);
snprintf(name, sizeof(name), "vhost-%d", current->pid);
 
-   vtsk = vhost_task_create(vhost_worker, worker, name);
-   if (!vtsk) {
-   ret = -ENOMEM;
-   goto free_worker;
-   }
+   vtsk = vhost_task_create(vhost_worker, &dev->worker, name);
+   if (!vtsk)
+   return -ENOMEM;
 
-   worker->vtsk = vtsk;
+   dev->worker.kcov_handle = kcov_common_handle();
+   dev->worker.vtsk = vtsk;
vhost_task_start(vtsk);
return 0;
-
-free_worker:
-   kfree(worker);
-   dev->worker

Re: [PATCH 1/1] vhost: Fix crash during early vhost_transport_send_pkt calls

2023-06-06 Thread Mike Christie

On 6/6/23 2:22 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 06, 2023 at 12:19:10PM -0500, Mike Christie wrote:
>> On 6/6/23 4:49 AM, Stefano Garzarella wrote:
>>> On Mon, Jun 05, 2023 at 01:57:30PM -0500, Mike Christie wrote:
>>>> If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we
>>>> can race where:
>>>> 1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue
>>>> 2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create.
>>>> 3. vhost_worker_create will set the dev->worker pointer before setting
>>>> the worker->vtsk pointer.
>>>> 4. thread0's vhost_work_queue will see the dev->worker pointer is
>>>> set and try to call vhost_task_wake using not yet set worker->vtsk
>>>> pointer.
>>>> 5. We then crash since vtsk is NULL.
>>>>
>>>> Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker
>>>> threads"), we only had the worker pointer so we could just check it to
>>>> see if VHOST_SET_OWNER has been done. After that commit we have the
>>>> vhost_worker and vhost_task pointers, so we can now hit the bug above.
>>>>
>>>> This patch embeds the vhost_worker in the vhost_dev, so we can just
>>>> check the worker.vtsk pointer to check if VHOST_SET_OWNER has been done
>>>> like before.
>>>>
>>>> Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
>>>
>>> We should add:
>>>
>>> Reported-by: syzbot+d0d442c22fa8db45f...@syzkaller.appspotmail.com
>>
>>
>> Ok. Will do.
>>
>>
>>>> -    }
>>>> +    vtsk = vhost_task_create(vhost_worker, &dev->worker, name);
>>>> +    if (!vtsk)
>>>> +    return -ENOMEM;
>>>>
>>>> -    worker->vtsk = vtsk;
>>>> +    dev->worker.kcov_handle = kcov_common_handle();
>>>> +    dev->worker.vtsk = vtsk;
>>>
>>> vhost_work_queue() is called by vhost_transport_send_pkt() without
>>> holding vhost_dev.mutex (like vhost_poll_queue() in several places).
>>>
>>> If vhost_work_queue() finds dev->worker.vtsk not NULL, how can we
>>> be sure that for example `work_list` has been initialized?
>>>
>>> Maybe I'm overthinking since we didn't have this problem before or the
>>> race is really short that it never happened.
>>
>> Yeah, I dropped the READ/WRITE_ONCE use because I didn't think we needed
>> it for the vhost_vsock_start case, and for the case you mentioned above, I
>> wondered the same thing as you but was not sure so I added the same
>> behavior as before. When I read memory-barriers.txt, it sounds like we've
>> been getting lucky.
> 
> Yea READ/WRITE_ONCE is one of these things. Once you start adding
> them it's hard to stop, you begin to think it's needed everywhere.
> To actually know you need a language lawyer (READ/WRITE_ONCE
> is a compiler thing not a CPU thing).

I am worried about the compiler reordering when init_llist_head sets
the llist->first pointer to NULL and when we set the vtsk pointer.
If they are in reverse order, then vhost_work_queue could detect
there is a vtsk, but then after we add work to the work_list we could
set llist->first to NULL.

However, to avoid this type of thing we need to have init_llist_head
use WRITE_ONCE right as well as vhost_worker_create when it sets vtsk
in the patch.

One other thing is that we could just move the init_llist_head to
vhost_dev_init like it was before. We should be safe like before.





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-06 Thread Mike Christie

On 6/6/23 2:39 PM, Oleg Nesterov wrote:
> On 06/06, Mike Christie wrote:
>>
>> On 6/6/23 7:16 AM, Oleg Nesterov wrote:
>>> On 06/05, Mike Christie wrote:
>>>
>>>> So it works like if we were using a kthread still:
>>>>
>>>> 1. Userapce thread0 opens /dev/vhost-$something.
>>>> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
>>>> create the task_struct which runs the vhost_worker() function which handles
>>>> the work->fns.
>>>> 3. If userspace now does a SIGKILL or just exits without doing a close() on
>>>> /dev/vhost-$something, then when thread0 does exit_files() that will do the
>>>> fput that does vhost-$something's file_operations->release.
>>>
>>> So, at least in this simple case vhost_worker() can just exit after SIGKILL,
>>> and thread0 can flush the outstanding commands when it calls 
>>> vhost_dev_flush()
>>> rather than wait for vhost_worker().
>>>
>>> Right?
>>
>> With the current code, the answer is no. We would hang like I mentioned here:
>>
>> https://lore.kernel.org/lkml/ae250076-7d55-c407-1066-86b37014c...@oracle.com/
> 
> If only I could fully understand this email ;)
> 
> Could you spell to explain why this can't work (again, in this simple case) ?
> 
> My current (and I know, very poor) understanding is that .release() should
> roughly do the following:
> 
>   1. Ensure that vhost_work_queue() can't add the new callbacks
> 
>   2. Call vhost_dev_flush() to ensure that worker->work_list is empty
> 

The problem is what do we do in the work->fn.

What you wrote is correct for cleaning up the work_list. However, the lower 
level
vhost drivers, like vhost-scsi, will do something like:

async_submit_request_to_storage/net_layer()

from their work->fn. The submission is async so when the request completes it
calls some callbacks that call into the vhost driver and vhost layer. For
vhost-scsi the call back will run vhost_queue_work so we can complete the 
request
from the vhost_task.

So if we've already run the work->fn then we need to add code to handle the
completion of the request we submitted. We need:

1. vhost_queue_work needs some code to detect when the vhost_task has exited
so we don't do vhost_task_wake on a freed task.

I was saying for this, we can sprinkle some RCU in there and in the code paths
we cleanup the vhost_task.

2. The next problem is that if the vhost_task is going to just loop over the
work_list and kill those works before it exits (or if we do it from the 
vhost_dev_flush
function), then we still have handle those async requests that got kicked off to
some other layer that are going to eventually complete and try to call
vhost_work_queue.

With #1, we can detect when the vhost_task is no longer usable, so we then need
to modify the drivers to detect that and instead of trying to execute like 
normal
where they queue the work, they just take their failure paths and free 
resources.

So the release cabllback was doing 2 things:
1. Flushing the work_list
2. Waiting on the those request completions

And so I was saying before I'm trying to finish up handling #2. I hit some
hiccups though because it turns out there is at least one case where we
don't have a vhost_task but we don't want to fail. It's just a matter of
coding it though.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/1] vhost: Fix crash during early vhost_transport_send_pkt calls

2023-06-06 Thread Mike Christie

On 6/6/23 4:49 AM, Stefano Garzarella wrote:
> On Mon, Jun 05, 2023 at 01:57:30PM -0500, Mike Christie wrote:
>> If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we
>> can race where:
>> 1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue
>> 2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create.
>> 3. vhost_worker_create will set the dev->worker pointer before setting
>> the worker->vtsk pointer.
>> 4. thread0's vhost_work_queue will see the dev->worker pointer is
>> set and try to call vhost_task_wake using not yet set worker->vtsk
>> pointer.
>> 5. We then crash since vtsk is NULL.
>>
>> Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker
>> threads"), we only had the worker pointer so we could just check it to
>> see if VHOST_SET_OWNER has been done. After that commit we have the
>> vhost_worker and vhost_task pointers, so we can now hit the bug above.
>>
>> This patch embeds the vhost_worker in the vhost_dev, so we can just
>> check the worker.vtsk pointer to check if VHOST_SET_OWNER has been done
>> like before.
>>
>> Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
> 
> We should add:
> 
> Reported-by: syzbot+d0d442c22fa8db45f...@syzkaller.appspotmail.com


Ok. Will do.


>> -    }
>> +    vtsk = vhost_task_create(vhost_worker, &dev->worker, name);
>> +    if (!vtsk)
>> +    return -ENOMEM;
>>
>> -    worker->vtsk = vtsk;
>> +    dev->worker.kcov_handle = kcov_common_handle();
>> +    dev->worker.vtsk = vtsk;
> 
> vhost_work_queue() is called by vhost_transport_send_pkt() without
> holding vhost_dev.mutex (like vhost_poll_queue() in several places).
> 
> If vhost_work_queue() finds dev->worker.vtsk not NULL, how can we
> be sure that for example `work_list` has been initialized?
> 
> Maybe I'm overthinking since we didn't have this problem before or the
> race is really short that it never happened.

Yeah, I dropped the READ/WRITE_ONCE use because I didn't think we needed
it for the vhost_vsock_start case, and for the case you mentioned above, I
wondered the same thing as you but was not sure so I added the same
behavior as before. When I read memory-barriers.txt, it sounds like we've
been getting lucky.
 
I'll add back the READ/WRITE_ONCE for vtsk access since that's what we are
keying off as signaling that the worker is ready to be used. I didn't see
any type of perf hit when using it, and from the memory-barriers.txt doc
it sounds like that's what we should be doing.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-06 Thread Mike Christie

On 6/6/23 7:16 AM, Oleg Nesterov wrote:
> On 06/05, Mike Christie wrote:
>>
>> On 6/5/23 10:10 AM, Oleg Nesterov wrote:
>>> On 06/03, michael.chris...@oracle.com wrote:
>>>>
>>>> On 6/2/23 11:15 PM, Eric W. Biederman wrote:
>>>> The problem is that as part of the flush the drivers/vhost/scsi.c code
>>>> will wait for outstanding commands, because we can't free the device and
>>>> it's resources before the commands complete or we will hit the accessing
>>>> freed memory bug.
>>>
>>> ignoring send-fd/clone issues, can we assume that the final fput/release
>>> should always come from vhost_worker's sub-thread (which shares mm/etc) ?
>>
>> I think I'm misunderstanding the sub-thread term.
>>
>> - Is it the task_struct's context that we did the
>> kernel/vhost_taskc.c:vhost_task_create() from? Below it would be the
>> thread we did VHOST_SET_OWNER from.
> 
> Yes,
> 
>> So it works like if we were using a kthread still:
>>
>> 1. Userapce thread0 opens /dev/vhost-$something.
>> 2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
>> create the task_struct which runs the vhost_worker() function which handles
>> the work->fns.
>> 3. If userspace now does a SIGKILL or just exits without doing a close() on
>> /dev/vhost-$something, then when thread0 does exit_files() that will do the
>> fput that does vhost-$something's file_operations->release.
> 
> So, at least in this simple case vhost_worker() can just exit after SIGKILL,
> and thread0 can flush the outstanding commands when it calls vhost_dev_flush()
> rather than wait for vhost_worker().
> 
> Right?

With the current code, the answer is no. We would hang like I mentioned here:

https://lore.kernel.org/lkml/ae250076-7d55-c407-1066-86b37014c...@oracle.com/

We need to add code like I mentioned in that reply because we don't have a
way to call into the layers below us to flush those commands. We need more
like an abort and don't call back into us type of operation. Or, I'm just trying
to add a check where we detect what happened then instead of trying to use
the vhost_task we try to complete in the context the lower level completes us
in.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 1/1] vhost: Fix crash during early vhost_transport_send_pkt calls

2023-06-05 Thread Mike Christie

If userspace does VHOST_VSOCK_SET_GUEST_CID before VHOST_SET_OWNER we
can race where:
1. thread0 calls vhost_transport_send_pkt -> vhost_work_queue
2. thread1 does VHOST_SET_OWNER which calls vhost_worker_create.
3. vhost_worker_create will set the dev->worker pointer before setting
the worker->vtsk pointer.
4. thread0's vhost_work_queue will see the dev->worker pointer is
set and try to call vhost_task_wake using not yet set worker->vtsk
pointer.
5. We then crash since vtsk is NULL.

Before commit 6e890c5d5021 ("vhost: use vhost_tasks for worker
threads"), we only had the worker pointer so we could just check it to
see if VHOST_SET_OWNER has been done. After that commit we have the
vhost_worker and vhost_task pointers, so we can now hit the bug above.

This patch embeds the vhost_worker in the vhost_dev, so we can just
check the worker.vtsk pointer to check if VHOST_SET_OWNER has been done
like before.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c | 50 +++
 drivers/vhost/vhost.h |  2 +-
 2 files changed, 18 insertions(+), 34 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 074273020849..0ad9fea7c170 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -235,7 +235,7 @@ void vhost_dev_flush(struct vhost_dev *dev)
 {
struct vhost_flush_struct flush;
 
-   if (dev->worker) {
+   if (dev->worker.vtsk) {
init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);
 
@@ -247,7 +247,7 @@ EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 {
-   if (!dev->worker)
+   if (!dev->worker.vtsk)
return;
 
if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
@@ -255,8 +255,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 * sure it was not in the list.
 * test_and_set_bit() implies a memory barrier.
 */
-   llist_add(&work->node, &dev->worker->work_list);
-   vhost_task_wake(dev->worker->vtsk);
+   llist_add(&work->node, &dev->worker.work_list);
+   vhost_task_wake(dev->worker.vtsk);
}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -264,7 +264,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-   return dev->worker && !llist_empty(&dev->worker->work_list);
+   return !llist_empty(&dev->worker.work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -456,7 +456,7 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   dev->worker = NULL;
+   memset(&dev->worker, 0, sizeof(dev->worker));
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -530,47 +530,31 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 
 static void vhost_worker_free(struct vhost_dev *dev)
 {
-   struct vhost_worker *worker = dev->worker;
-
-   if (!worker)
+   if (!dev->worker.vtsk)
return;
 
-   dev->worker = NULL;
-   WARN_ON(!llist_empty(&worker->work_list));
-   vhost_task_stop(worker->vtsk);
-   kfree(worker);
+   WARN_ON(!llist_empty(&dev->worker.work_list));
+   vhost_task_stop(dev->worker.vtsk);
+   dev->worker.kcov_handle = 0;
+   dev->worker.vtsk = NULL;
 }
 
 static int vhost_worker_create(struct vhost_dev *dev)
 {
-   struct vhost_worker *worker;
struct vhost_task *vtsk;
char name[TASK_COMM_LEN];
-   int ret;
-
-   worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
-   if (!worker)
-   return -ENOMEM;
 
-   dev->worker = worker;
-   worker->kcov_handle = kcov_common_handle();
-   init_llist_head(&worker->work_list);
+   init_llist_head(&dev->worker.work_list);
snprintf(name, sizeof(name), "vhost-%d", current->pid);
 
-   vtsk = vhost_task_create(vhost_worker, worker, name);
-   if (!vtsk) {
-   ret = -ENOMEM;
-   goto free_worker;
-   }
+   vtsk = vhost_task_create(vhost_worker, &dev->worker, name);
+   if (!vtsk)
+   return -ENOMEM;
 
-   worker->vtsk = vtsk;
+   dev->worker.kcov_handle = kcov_common_handle();
+   dev->worker.vtsk = vtsk;
vhost_task_start(vtsk);
return 0;
-
-free_worker:
-   kfree(worker);
-   dev->worker = NULL;
-   return ret;
 }
 
 /* Caller shoul

Re: [CFT][PATCH v3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-05 Thread Mike Christie

On 6/5/23 10:10 AM, Oleg Nesterov wrote:
> On 06/03, michael.chris...@oracle.com wrote:
>>
>> On 6/2/23 11:15 PM, Eric W. Biederman wrote:
>> The problem is that as part of the flush the drivers/vhost/scsi.c code
>> will wait for outstanding commands, because we can't free the device and
>> it's resources before the commands complete or we will hit the accessing
>> freed memory bug.
> 
> ignoring send-fd/clone issues, can we assume that the final fput/release
> should always come from vhost_worker's sub-thread (which shares mm/etc) ?

I think I'm misunderstanding the sub-thread term.

- Is it the task_struct's context that we did the
kernel/vhost_taskc.c:vhost_task_create() from? Below it would be the
thread we did VHOST_SET_OWNER from.

If so, then yes.

- Is it the task_struct that gets created by
kernel/vhost_taskc.c:vhost_task_create()?

If so, then the answer is no. vhost_task_create has set the no_files
arg on kernel_clone_args, so copy_files() sets task_struct->files to NULL
and we don't clone or dup the files.

So it works like if we were using a kthread still:

1. Userapce thread0 opens /dev/vhost-$something.
2. thread0 does VHOST_SET_OWNER ioctl. This calls vhost_task_create() to
create the task_struct which runs the vhost_worker() function which handles
the work->fns.
3. If userspace now does a SIGKILL or just exits without doing a close() on
/dev/vhost-$something, then when thread0 does exit_files() that will do the
fput that does vhost-$something's file_operations->release.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 1/1] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-06-01 Thread Mike Christie

When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be supported.

This is a modified version of the patch written by Mike Christie
 which was a modified version of patch
originally written by Linus.

Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.

Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c.  Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn.  vhost_worker now returns true if work was done.

The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit.  This collection clears
__fatal_signal_pending.  This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.

For now the vhost thread continues to exist and run work until the
last file descriptor is closed and the release function is called as
part of freeing struct file.  To avoid hangs in the coredump
rendezvous and when killing threads in a multi-threaded exec.  The
coredump code and de_thread have been modified to ignore vhost threads.

Remvoing the special case for exec appears to require teaching
vhost_dev_flush how to directly complete transactions in case
the vhost thread is no longer running.

Removing the special case for coredump rendezvous requires either the
above fix needed for exec or moving the coredump rendezvous into
get_signal.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Eric W. Biederman 
Co-developed-by: Mike Christie 
Signed-off-by: Mike Christie 
---
 arch/x86/include/asm/fpu/sched.h |  2 +-
 arch/x86/kernel/fpu/context.h|  2 +-
 arch/x86/kernel/fpu/core.c   |  2 +-
 drivers/vhost/vhost.c| 22 ++--
 fs/coredump.c|  4 +-
 include/linux/sched/task.h   |  1 -
 include/linux/sched/vhost_task.h | 15 ++
 kernel/exit.c|  5 +-
 kernel/fork.c| 13 ++---
 kernel/signal.c  |  8 +--
 kernel/vhost_task.c  | 92 +---
 11 files changed, 89 insertions(+), 77 deletions(-)

diff --git a/arch/x86/include/asm/fpu/sched.h b/arch/x86/include/asm/fpu/sched.h
index c2d6cd78ed0c..78fcde7b1f07 100644
--- a/arch/x86/include/asm/fpu/sched.h
+++ b/arch/x86/include/asm/fpu/sched.h
@@ -39,7 +39,7 @@ extern void fpu_flush_thread(void);
 static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
 {
if (cpu_feature_enabled(X86_FEATURE_FPU) &&
-   !(current->flags & (PF_KTHREAD | PF_IO_WORKER))) {
+   !(current->flags & (PF_KTHREAD | PF_USER_WORKER))) {
save_fpregs_to_fpstate(old_fpu);
/*
 * The save operation preserved register state, so the
diff --git a/arch/x86/kernel/fpu/context.h b/arch/x86/kernel/fpu/context.h
index 9fcfa5c4dad7..af5cbdd9bd29 100644
--- a/arch/x86/kernel/fpu/context.h
+++ b/arch/x86/kernel/fpu/context.h
@@ -57,7 +57,7 @@ static inline void fpregs_restore_userregs(void)
struct fpu *fpu = ¤t->thread.fpu;
int cpu = smp_processor_id();
 
-   if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_IO_WORKER)))
+   if (WARN_ON_ONCE(current->flags & (PF_KTHREAD | PF_USER_WORKER)))
return;
 
if (!fpregs_state_valid(fpu, cpu)) {
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index caf33486dc5e..1015af1ae562 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -426,7 +426,7 @@ void kernel_fpu_begin_mask(unsigned int kfpu_mask)
 
this_cpu_write(in_kernel_fpu, true);
 
-   if (!(current->flags & (PF_KTHREAD | PF_IO_WORKER)) &&
+   if (!(current->flags & (PF_KTHREAD | PF_USER_WORKER)) &&
!test_thread_flag(TIF_NEED_FPU_LOAD)) {
set_thread_flag(TIF_NEED_FPU_LOAD);
save_fpregs_to_fpstate(¤t->thread.fpu);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..074273020849 100644
--- a/d

Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-06-01 Thread Mike Christie

On 6/1/23 2:47 AM, Stefano Garzarella wrote:
>>
>> static void vhost_worker_free(struct vhost_dev *dev)
>> {
>> -    struct vhost_worker *worker = dev->worker;
>> +    struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk);
>>
>> -    if (!worker)
>> +    if (!vtsk)
>>     return;
>>
>> -    dev->worker = NULL;
>> -    WARN_ON(!llist_empty(&worker->work_list));
>> -    vhost_task_stop(worker->vtsk);
>> -    kfree(worker);
>> +    vhost_task_stop(vtsk);
>> +    WARN_ON(!llist_empty(&dev->worker.work_list));
>> +    WRITE_ONCE(dev->worker.vtsk, NULL);
> 
> The patch LGTM, I just wonder if we should set dev->worker to zero here,

We might want to just set kcov_handle to zero for now.

In 6.3 and older, I think we could do:

1. vhost_dev_set_owner could successfully set dev->worker.
2. vhost_transport_send_pkt runs vhost_work_queue and sees worker
is set and adds the vhost_work to the work_list.
3. vhost_dev_set_owner fails in vhost_attach_cgroups, so we stop
the worker before the work can be run and set worker to NULL.
4. We clear kcov_handle and return.

We leave the work on the work_list.

5. Userspace can then retry vhost_dev_set_owner. If that works, then the
work gets executed ok eventually.

OR

Userspace can just close the device. vhost_vsock_dev_release would
eventually call vhost_dev_cleanup (vhost_dev_flush won't see a worker
so will just return), and that will hit the WARN_ON but we would
proceed ok.

If I do a memset of the worker, then if userspace were to retry
VHOST_SET_OWNER, we would lose the queued work since the work_list would
get zero'd. I think it's unlikely this ever happens, but you know best
so let me know if this a real issue.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 0/8] vhost_tasks: Use CLONE_THREAD/SIGHAND

2023-06-01 Thread Mike Christie

On 6/1/23 5:47 AM, Christian Brauner wrote:
> On Thu, Jun 01, 2023 at 09:58:38AM +0200, Thorsten Leemhuis wrote:
>> On 19.05.23 14:15, Christian Brauner wrote:
>>> On Thu, May 18, 2023 at 10:25:11AM +0200, Christian Brauner wrote:
>>>> On Wed, May 17, 2023 at 07:09:12PM -0500, Mike Christie wrote:
>>>>> This patch allows the vhost and vhost_task code to use CLONE_THREAD,
>>>>> CLONE_SIGHAND and CLONE_FILES. It's a RFC because I didn't do all the
>>>>> normal testing, haven't coverted vsock and vdpa, and I know you guys
>>>>> will not like the first patch. However, I think it better shows what
>>>> Just to summarize the core idea behind my proposal is that no signal
>>>> handling changes are needed unless there's a bug in the current way
>>>> io_uring workers already work. All that should be needed is
>>>> s/PF_IO_WORKER/PF_USER_WORKER/ in signal.c.
>> [...]
>>>> So it feels like this should be achievable by adding a callback to
>>>> struct vhost_worker that get's called when vhost_worker() gets SIGKILL
>>>> and that all the users of vhost workers are forced to implement.
>>>>
>>>> Yes, it is more work but I think that's the right thing to do and not to
>>>> complicate our signal handling.
>>>>
>>>> Worst case if this can't be done fast enough we'll have to revert the
>>>> vhost parts. I think the user worker parts are mostly sane and are
>>> As mentioned, if we can't settle this cleanly before -rc4 we should
>>> revert the vhost parts unless Linus wants to have it earlier.
>> Meanwhile -rc5 is just a few days away and there are still a lot of
>> discussions in the patch-set proposed to address the issues[1]. Which is
>> kinda great (albeit also why I haven't given it a spin yet), but on the
>> other hand makes we wonder:
> You might've missed it in the thread but it seems everyone is currently
> operating under the assumption that the preferred way is to fix this is
> rather than revert. See the mail in [1]:
> 
> "So I'd really like to finish this. Even if we end up with a hack or
> two in signal handling that we can hopefully fix up later by having
> vhost fix up some of its current assumptions."
> 
> which is why no revert was send for -rc4. And there's a temporary fix we
> seem to have converged on.
> 
> @Mike, do you want to prepare an updated version of the temporary fix.
> If @Linus prefers to just apply it directly he can just grab it from the
> list rather than delaying it. Make sure to grab a Co-developed-by line
> on this, @Mike.

Yes, I'll send it within a couple hours.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-05-31 Thread Mike Christie

On 5/31/23 10:15 AM, Mike Christie wrote:
>>> rcu would work for your case and for what Jason had requested.
>> Yeah, so you already have some patches?
>>
>> Do you want to send it to solve this problem?
>>
> Yeah, I'll break them out and send them later today when I can retest
> rebased patches.
> 

Just one question. Do you core vhost developers consider RCU more complex
or switching to READ_ONCE/WRITE_ONCE? I am asking because for this immediate
regression fix we could just switch to the latter like below to just fix
the crash if we think that is more simple.

I think RCU is just a little more complex/invasive because it will have the
extra synchronize_rcu calls.


diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..03fd47a22a73 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -235,7 +235,7 @@ void vhost_dev_flush(struct vhost_dev *dev)
 {
struct vhost_flush_struct flush;
 
-   if (dev->worker) {
+   if (READ_ONCE(dev->worker.vtsk)) {
init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);
 
@@ -247,7 +247,9 @@ EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 {
-   if (!dev->worker)
+   struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk);
+
+   if (!vtsk)
return;
 
if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
@@ -255,8 +257,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 * sure it was not in the list.
 * test_and_set_bit() implies a memory barrier.
 */
-   llist_add(&work->node, &dev->worker->work_list);
-   wake_up_process(dev->worker->vtsk->task);
+   llist_add(&work->node, &dev->worker.work_list);
+   wake_up_process(vtsk->task);
}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -264,7 +266,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-   return dev->worker && !llist_empty(&dev->worker->work_list);
+   return !llist_empty(&dev->worker.work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -468,7 +470,7 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   dev->worker = NULL;
+   memset(&dev->worker, 0, sizeof(dev->worker));
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -542,46 +544,38 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 
 static void vhost_worker_free(struct vhost_dev *dev)
 {
-   struct vhost_worker *worker = dev->worker;
+   struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk);
 
-   if (!worker)
+   if (!vtsk)
return;
 
-   dev->worker = NULL;
-   WARN_ON(!llist_empty(&worker->work_list));
-   vhost_task_stop(worker->vtsk);
-   kfree(worker);
+   vhost_task_stop(vtsk);
+   WARN_ON(!llist_empty(&dev->worker.work_list));
+   WRITE_ONCE(dev->worker.vtsk, NULL);
 }
 
 static int vhost_worker_create(struct vhost_dev *dev)
 {
-   struct vhost_worker *worker;
struct vhost_task *vtsk;
char name[TASK_COMM_LEN];
int ret;
 
-   worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
-   if (!worker)
-   return -ENOMEM;
-
-   dev->worker = worker;
-   worker->kcov_handle = kcov_common_handle();
-   init_llist_head(&worker->work_list);
+   dev->worker.kcov_handle = kcov_common_handle();
+   init_llist_head(&dev->worker.work_list);
snprintf(name, sizeof(name), "vhost-%d", current->pid);
 
-   vtsk = vhost_task_create(vhost_worker, worker, name);
+   vtsk = vhost_task_create(vhost_worker, &dev->worker, name);
if (!vtsk) {
ret = -ENOMEM;
goto free_worker;
}
 
-   worker->vtsk = vtsk;
+   WRITE_ONCE(dev->worker.vtsk, vtsk);
vhost_task_start(vtsk);
return 0;
 
 free_worker:
-   kfree(worker);
-   dev->worker = NULL;
+   WRITE_ONCE(dev->worker.vtsk, NULL);
return ret;
 }
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0308638cdeee..305ec8593d46 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -154,7 +154,7 @@ struct vhost_dev {
struct vhost_virtqueue **vqs;
int nvqs;
struct eventfd_ctx *log_ctx;
-   struct vhost_worker *worker;
+   struct vhost_worker worker;
struct vhost_iotlb *umem;
struct vhost_iotlb *iotlb;
spinlock_t iotlb_lock;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-05-31 Thread Mike Christie

On 5/31/23 2:27 AM, Stefano Garzarella wrote:
> On Tue, May 30, 2023 at 6:30 PM  wrote:
>>
>> On 5/30/23 11:17 AM, Stefano Garzarella wrote:
>>> On Tue, May 30, 2023 at 11:09:09AM -0500, Mike Christie wrote:
>>>> On 5/30/23 11:00 AM, Stefano Garzarella wrote:
>>>>> I think it is partially related to commit 6e890c5d5021 ("vhost: use
>>>>> vhost_tasks for worker threads") and commit 1a5f8090c6de ("vhost: move
>>>>> worker thread fields to new struct"). Maybe that commits just
>>>>> highlighted the issue and it was already existing.
>>>>
>>>> See my mail about the crash. Agree with your analysis about worker->vtsk
>>>> not being set yet. It's a bug from my commit where I should have not set
>>>> it so early or I should be checking for
>>>>
>>>> if (dev->worker && worker->vtsk)
>>>>
>>>> instead of
>>>>
>>>> if (dev->worker)
>>>
>>> Yes, though, in my opinion the problem may persist depending on how the
>>> instructions are reordered.
>>
>> Ah ok.
>>
>>>
>>> Should we protect dev->worker() with an RCU to be safe?
>>
>> For those multiple worker patchsets Jason had asked me about supporting
>> where we don't have a worker while we are swapping workers around. To do
>> that I had added rcu around the dev->worker. I removed it in later patchsets
>> because I didn't think anyone would use it.
>>
>> rcu would work for your case and for what Jason had requested.
> 
> Yeah, so you already have some patches?
> 
> Do you want to send it to solve this problem?
> 

Yeah, I'll break them out and send them later today when I can retest
rebased patches.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-30 Thread Mike Christie

On 5/29/23 9:38 PM, Eric W. Biederman wrote:
> diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
> index b7cbd66f889e..f3ce0fa6edd7 100644
> --- a/kernel/vhost_task.c
> +++ b/kernel/vhost_task.c

...

>  static int vhost_task_fn(void *data)
>  {
>   struct vhost_task *vtsk = data;
> - int ret;
> + bool dead = false;
> +
> + for (;;) {
> + bool did_work;
> +
> + /* mb paired w/ kthread_stop */
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
> + __set_current_state(TASK_RUNNING);
> + break;
> + }
> +
> + if (!dead && signal_pending(current)) {
> + struct ksignal ksig;
> + /*
> +  * Calling get_signal will block in SIGSTOP,
> +  * or clear fatal_signal_pending, but remember
> +  * what was set.
> +  *
> +  * This thread won't actually exit until all
> +  * of the file descriptors are closed, and
> +  * the release function is called.
> +  */
> + dead = get_signal(&ksig);

Hey Eric, the patch works well. Thanks! There was just one small issue.

get_signal() does try_to_freeze() -> ... __might_sleep() which wants the
state to be TASK_RUNNING.

We just need the patch below on top of yours which I think also cleans up
some of the state setting weirdness with the code like where vhost.c calls
__set_current_state(TASK_RUNNING) for each work. It looks like that was
not needed for any reason like a work->fn changing the state and the old
code could have done:

node = llist_del_all(&worker->work_list);
if (!node) {
schedule();
continue;
} else {
__set_current_state(TASK_RUNNING);
}

So I think we can do the following on top of your patch:

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 221d1b6c1be5..074273020849 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -346,7 +346,6 @@ static bool vhost_worker(void *data)
smp_wmb();
llist_for_each_entry_safe(work, work_next, node, node) {
clear_bit(VHOST_WORK_QUEUED, &work->flags);
-   __set_current_state(TASK_RUNNING);
kcov_remote_start_common(worker->kcov_handle);
work->fn(work);
kcov_remote_stop();
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index f3ce0fa6edd7..fead2ed29561 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -29,12 +29,8 @@ static int vhost_task_fn(void *data)
bool did_work;
 
/* mb paired w/ kthread_stop */
-   set_current_state(TASK_INTERRUPTIBLE);
-
-   if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags)) {
-   __set_current_state(TASK_RUNNING);
+   if (test_bit(VHOST_TASK_FLAGS_STOP, &vtsk->flags))
break;
-   }
 
if (!dead && signal_pending(current)) {
struct ksignal ksig;
@@ -53,8 +49,10 @@ static int vhost_task_fn(void *data)
}
 
did_work = vtsk->fn(vtsk->data);
-   if (!did_work)
+   if (!did_work) {
+   set_current_state(TASK_INTERRUPTIBLE);
schedule();
+   }
}
 
complete(&vtsk->exited);



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-05-30 Thread Mike Christie

On 5/30/23 11:00 AM, Stefano Garzarella wrote:
> I think it is partially related to commit 6e890c5d5021 ("vhost: use
> vhost_tasks for worker threads") and commit 1a5f8090c6de ("vhost: move
> worker thread fields to new struct"). Maybe that commits just
> highlighted the issue and it was already existing.

See my mail about the crash. Agree with your analysis about worker->vtsk
not being set yet. It's a bug from my commit where I should have not set
it so early or I should be checking for

if (dev->worker && worker->vtsk)

instead of 

if (dev->worker)

One question about the behavior before my commit though and what we want in
the end going forward. Before that patch we would just drop work if
vhost_work_queue was called before VHOST_SET_OWNER. Was that correct/expected?

The call to vhost_work_queue in vhost_vsock_start was only seeing the
works queued after VHOST_SET_OWNER. Did you want works queued before that?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-05-30 Thread Mike Christie

On 5/30/23 10:58 AM, Mike Christie wrote:
> On 5/30/23 8:44 AM, Stefano Garzarella wrote:
>>
>> From a first glance, it looks like an issue when we call vhost_work_queue().
>> @Mike, does that ring any bells since you recently looked at that code?
> 
> I see the bug. needed to have set the dev->worker after setting worker->vtsk
> like below:
> 
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index a92af08e7864..7bd95984a501 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -564,7 +564,6 @@ static int vhost_worker_create(struct vhost_dev *dev)
>   if (!worker)
>   return -ENOMEM;
>  
> - dev->worker = worker;
>   worker->kcov_handle = kcov_common_handle();
>   init_llist_head(&worker->work_list);
>   snprintf(name, sizeof(name), "vhost-%d", current->pid);
> @@ -576,6 +575,7 @@ static int vhost_worker_create(struct vhost_dev *dev)
>   }
>  
>   worker->vtsk = vtsk;

Shoot, oh wait, I think I needed a smp_wmb to always make sure worker->vtask
is set before dev->worker or vhost_work_queue could still end up seeing
dev->worker set before worker->vtsk right?

> + dev->worker = worker;>  vhost_task_start(vtsk);
>   return 0;
>  

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-05-30 Thread Mike Christie

On 5/30/23 8:44 AM, Stefano Garzarella wrote:
> 
> From a first glance, it looks like an issue when we call vhost_work_queue().
> @Mike, does that ring any bells since you recently looked at that code?

I see the bug. needed to have set the dev->worker after setting worker->vtsk
like below:


diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..7bd95984a501 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -564,7 +564,6 @@ static int vhost_worker_create(struct vhost_dev *dev)
if (!worker)
return -ENOMEM;
 
-   dev->worker = worker;
worker->kcov_handle = kcov_common_handle();
init_llist_head(&worker->work_list);
snprintf(name, sizeof(name), "vhost-%d", current->pid);
@@ -576,6 +575,7 @@ static int vhost_worker_create(struct vhost_dev *dev)
}
 
worker->vtsk = vtsk;
+   dev->worker = worker;
vhost_task_start(vtsk);
return 0;
 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-30 Thread Mike Christie

On 5/29/23 9:38 PM, Eric W. Biederman wrote:
> Mike is there any chance you can test the change below?

Yeah, I'm firing up tests now.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Mike Christie

On 5/29/23 12:46 PM, Oleg Nesterov wrote:
> Mike, sorry, I don't understand your email.
> 
> Just in case, let me remind I know nothing about drivers/vhost/
> 

No problem. I get it. I don't know the signal/thread code so it's
one of those things where I'm also learning as I go.

> On 05/29, michael.chris...@oracle.com wrote:
>>
>> On 5/29/23 6:19 AM, Oleg Nesterov wrote:
>>> On 05/27, Eric W. Biederman wrote:

 Looking forward I don't see not asking the worker threads to stop
 for the coredump right now causing any problems in the future.
 So I think we can use this to resolve the coredump issue I spotted.
>>>
>>> But we have almost the same problem with exec.
>>>
>>> Execing thread will wait for vhost_worker() while vhost_worker will wait for
>>> .release -> vhost_task_stop().
>>
>> For this type of case, what is the goal or correct behavior in the end?
>>
>> When get_signal returns true we can code things like you mention below
> 
> and you have mentioned in the next email that you have already coded something
> like this, so perhaps we can delay the further discussions until you send the
> new code?
> 

Ok. Let me post that. You guys and the vhost devs can argue about if it's
too ugly to merge :)

>> and
>> clean up the task_struct.
> 
> Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap 
> itself,

Oh wait, are you saying that when we get auto-reaped then we would do the last
fput and call the file_operations->release function right? We actually set
task_struct->files = NULL for the vhost_task task_struct, so I think we call
release a little sooner than you think.

vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task 
task_struc
that gets created works like kthreads where it doesn't do a CLONE_FILES and it
doesn't do a dup_fd.

So when we do de_thread() -> zap_other_threads(), that will kill all the threads
in the group right? So when they exit, it will call our release function since
we don't have refcount on ourself.

> 
>> However, we now have a non-functioning vhost device
>> open and just sitting around taking up memory and it can't do any IO.
> 
> can't comment, see above.
> 
>> For this type of case, do we expect just not to crash/hang, or was this new
>> exec'd thread suppose to be able to use the vhost device?
> 
> I just tried to point out that (unless I missed something) there are more 
> corner
> cases, not just coredump.

Ok. I see.

> 
>>> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...
>>
>> You mean the vhost_task's task/thread doing a function that does a 
>> copy_process
>> right?
> 
> I meant that the vhost_task's sub-thread can do sys_clone(CLONE_FILES) from
> userspace.

I think we were talking about different things. I was saying that when the vhost
layer does vhost_task_create() -> copy_process(), the kernel_clone_args.fn is
vhost_task_fn() -> vhost_worker(). I thought you were concerned about 
vhost_worker()
or some function it calls, calling copy_process() with CLONE_FILES.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Mike Christie

On 5/29/23 12:54 PM, Oleg Nesterov wrote:
> On 05/29, Oleg Nesterov wrote:
>>
>> Mike, sorry, I don't understand your email.
>>
>> Just in case, let me remind I know nothing about drivers/vhost/
> 
> And... this is slightly off-topic but you didn't answer my previous
> question and I am just curious. Do you agree that, even if we forget
> about CLONE_THREAD/signals, vhost_worker() needs fixes anyway because
> it can race with vhost_work_queue() ?

You saw the reply where I wrote I was waiting for the vhost devs to
reply because it's their code, right? Just wanted to make sure you know
I'm not ignoring your mails. The info is very valuable to me since I don't
know the signal code.

- I think you are right about using a continue after schedule.
- The work fn is stable. It's setup once and never changes.
- I have no idea why we do the __set_current_state(TASK_RUNNING). As
far as I can tell the work functions do not change the task state and
that might have been a left over from something else. However, the
vhost devs might know of some case.
- For the barrier use, I think you are right, but I couldn't trigger
an issue even if I hack up different timings. So I was hoping the
vhost devs know of something else in there.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-28 Thread Mike Christie

On 5/27/23 8:41 PM, Eric W. Biederman wrote:
> Mike Christie  writes:
> 
>> On 5/23/23 7:15 AM, Oleg Nesterov wrote:
>>>
>>> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
>>> before we call work->fn(). Is it "safe" to run this callback with
>>> signal_pending() or fatal_signal_pending() ?
>>
>> The questions before this one I'll leave for the core vhost devs since
>> they know best.
> 
> Let me ask a clarifying question:
> 
> Is it only the call to schedule() in vhost_worker that you are worried
> about not sleeping if signal_pending() or fatal_signal_pending()?

It will only be the vhost_worker call to schedule().

When we do the file_operation's release call, we normally set things up
so the work->fn just fails and cleans up. I can pretty easily move that
code into a helper and do:

if (get_signal(ksig))
new_function_to_tell_drivers_that_all_work_fns_should_fail()


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-25 Thread Mike Christie

On 5/23/23 7:15 AM, Oleg Nesterov wrote:
> 
> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> before we call work->fn(). Is it "safe" to run this callback with
> signal_pending() or fatal_signal_pending() ?

The questions before this one I'll leave for the core vhost devs since
they know best.

For this question and the one below, when we get SIGKILL we think it's ok
to fail the operation as in it's ok to not execute it like normal (send
it down to the net/target/scsi/block layers for execution). However, we
need to do some processing of the work to release mem, refcounts, etc.

For example we use the vhost_task to handle completions from the layers
we interact with. So when we get a SIGKILL, we could have N commands in
the target/block/scsi/nvme layers below the vhost layer. When those
complete, the current code assumes we have the vhost_task to handle the
responses. So for pending works on that work_list, we can pretty easily
kill them. However, we don't have a way to kill those outstanding
requests to some other layer, so we have to wait and handle them
somewhere.

If you are saying that once we get SIGKILL then we just can't use the
task anymore and we have to drop down to workqueue/kthread or change up
the completion paths so they can execute in the context they are completed
from by lower levels, then I can code this. On the vhost side, it's just
really ugly vs the code we have now that used to use kthreads (or in
6.4-rc looks like a process) so we couldn't get signals and just had the
one code path that removes us.

>From the vhost point of view signals are not useful to us and it's just
adding complexity for something that I don't think is useful to users.
However after discussing this with guys for a week and going over the
signal code, I can see your point of views where you guys are thinking its
ugly for the signal code to try and support what we want :)

> 
> 
> Finally. I never looked into drivers/vhost/ before so I don't understand
> this code at all, but let me ask anyway... Can we change vhost_dev_flush()
> to run the pending callbacks rather than wait for vhost_worker() ?
> I guess we can't, ->mm won't be correct, but can you confirm?
> 
> Oleg.
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/3] vhost-scsi: Fix alignment handling with windows

2023-05-25 Thread Mike Christie

On 5/25/23 2:57 AM, Michael S. Tsirkin wrote:
> On Wed, May 24, 2023 at 06:34:05PM -0500, Mike Christie wrote:
>> The linux block layer requires bios/requests to have lengths with a 512
>> byte alignment. Some drivers/layers like dm-crypt and the directi IO code
>> will test for it and just fail. Other drivers like SCSI just assume the
>> requirement is met and will end up in infinte retry loops. The problem
>> for drivers like SCSI is that it uses functions like blk_rq_cur_sectors
>> and blk_rq_sectors which divide the request's length by 512. If there's
>> lefovers then it just gets dropped. But other code in the block/scsi
>> layer may use blk_rq_bytes/blk_rq_cur_bytes and end up thinking there is
>> still data left and try to retry the cmd. We can then end up getting
>> stuck in retry loops where part of the block/scsi thinks there is data
>> left, but other parts think we want to do IOs of zero length.
>>
>> Linux will always check for alignment, but windows will not. When
>> vhost-scsi then translates the iovec it gets from a windows guest to a
>> scatterlist, we can end up with sg items where the sg->length is not
>> divisible by 512 due to the misaligned offset:
>>
>> sg[0].offset = 255;
>> sg[0].length = 3841;
>> sg...
>> sg[N].offset = 0;
>> sg[N].length = 255;
>>
>> When the lio backends then convert the SG to bios or other iovecs, we
>> end up sending them with the same misaligned values and can hit the
>> issues above.
>>
>> This just has us drop down to allocating a temp page and copying the data
>> when this happens. Because this adds a check that needs to loop over the
>> iovec in the main IO path, this patch adds an attribute that can be turned
>> on for just windows.
>>
>> Signed-off-by: Mike Christie 
> 
> I am assuming this triggers rarely, yes?
> 
> If so I would like to find a way to avoid setting an attribute.
> 
> I am guessing the main cost is an extra scan of iov.

The scan and a memory allocation to dup the iter. However, I see
iov_iter_revert and I think that might be what I needed before to
avoid the mem alloc so will try it out.


> vhost_scsi_iov_to_sgl already does a scan, how about changing
> it to fail if misaligned?
> We can then detect the relevant error code and try with a copy.
> 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 3/3] vhost-scsi: Rename vhost_scsi_iov_to_sgl

2023-05-24 Thread Mike Christie

Rename vhost_scsi_iov_to_sgl to vhost_scsi_map_iov_to_sgl so it matches
matches the naming style used for vhost_scsi_copy_iov_to_sgl.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 382158b39740..a4d32b96e66a 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -694,8 +694,8 @@ vhost_scsi_calc_sgls(struct iov_iter *iter, size_t bytes, 
int max_sgls)
 }
 
 static int
-vhost_scsi_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter,
- struct scatterlist *sg, int sg_count)
+vhost_scsi_map_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter,
+ struct scatterlist *sg, int sg_count)
 {
struct scatterlist *p = sg;
int ret;
@@ -778,8 +778,9 @@ vhost_scsi_mapal(struct vhost_scsi_tpg *tpg, struct 
vhost_scsi_cmd *cmd,
pr_debug("%s prot_sg %p prot_sgl_count %u\n", __func__,
 cmd->tvc_prot_sgl, cmd->tvc_prot_sgl_count);
 
-   ret = vhost_scsi_iov_to_sgl(cmd, prot_iter, cmd->tvc_prot_sgl,
-   cmd->tvc_prot_sgl_count);
+   ret = vhost_scsi_map_iov_to_sgl(cmd, prot_iter,
+   cmd->tvc_prot_sgl,
+   cmd->tvc_prot_sgl_count);
if (ret < 0)
goto map_fail;
}
@@ -808,8 +809,8 @@ vhost_scsi_mapal(struct vhost_scsi_tpg *tpg, struct 
vhost_scsi_cmd *cmd,
ret = vhost_scsi_copy_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl,
 cmd->tvc_sgl_count);
else
-   ret = vhost_scsi_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl,
-   cmd->tvc_sgl_count);
+   ret = vhost_scsi_map_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl,
+   cmd->tvc_sgl_count);
if (ret)
goto map_fail;
 
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 1/3] vhost-scsi: Fix alignment handling with windows

2023-05-24 Thread Mike Christie

The linux block layer requires bios/requests to have lengths with a 512
byte alignment. Some drivers/layers like dm-crypt and the directi IO code
will test for it and just fail. Other drivers like SCSI just assume the
requirement is met and will end up in infinte retry loops. The problem
for drivers like SCSI is that it uses functions like blk_rq_cur_sectors
and blk_rq_sectors which divide the request's length by 512. If there's
lefovers then it just gets dropped. But other code in the block/scsi
layer may use blk_rq_bytes/blk_rq_cur_bytes and end up thinking there is
still data left and try to retry the cmd. We can then end up getting
stuck in retry loops where part of the block/scsi thinks there is data
left, but other parts think we want to do IOs of zero length.

Linux will always check for alignment, but windows will not. When
vhost-scsi then translates the iovec it gets from a windows guest to a
scatterlist, we can end up with sg items where the sg->length is not
divisible by 512 due to the misaligned offset:

sg[0].offset = 255;
sg[0].length = 3841;
sg...
sg[N].offset = 0;
sg[N].length = 255;

When the lio backends then convert the SG to bios or other iovecs, we
end up sending them with the same misaligned values and can hit the
issues above.

This just has us drop down to allocating a temp page and copying the data
when this happens. Because this adds a check that needs to loop over the
iovec in the main IO path, this patch adds an attribute that can be turned
on for just windows.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 174 +--
 1 file changed, 151 insertions(+), 23 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index bb10fa4bb4f6..dbad8fb579eb 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -25,6 +25,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -75,6 +77,9 @@ struct vhost_scsi_cmd {
u32 tvc_prot_sgl_count;
/* Saved unpacked SCSI LUN for vhost_scsi_target_queue_cmd() */
u32 tvc_lun;
+   u32 copied_iov:1;
+   const void *saved_iter_addr;
+   struct iov_iter saved_iter;
/* Pointer to the SGL formatted memory from virtio-scsi */
struct scatterlist *tvc_sgl;
struct scatterlist *tvc_prot_sgl;
@@ -107,6 +112,7 @@ struct vhost_scsi_nexus {
 struct vhost_scsi_tpg {
/* Vhost port target portal group tag for TCM */
u16 tport_tpgt;
+   bool check_io_alignment;
/* Used to track number of TPG Port/Lun Links wrt to explict I_T Nexus 
shutdown */
int tv_tpg_port_count;
/* Used for vhost_scsi device reference to tpg_nexus, protected by 
tv_tpg_mutex */
@@ -328,8 +334,13 @@ static void vhost_scsi_release_cmd_res(struct se_cmd 
*se_cmd)
int i;
 
if (tv_cmd->tvc_sgl_count) {
-   for (i = 0; i < tv_cmd->tvc_sgl_count; i++)
-   put_page(sg_page(&tv_cmd->tvc_sgl[i]));
+   for (i = 0; i < tv_cmd->tvc_sgl_count; i++) {
+   if (tv_cmd->copied_iov)
+   __free_page(sg_page(&tv_cmd->tvc_sgl[i]));
+   else
+   put_page(sg_page(&tv_cmd->tvc_sgl[i]));
+   }
+   kfree(tv_cmd->saved_iter_addr);
}
if (tv_cmd->tvc_prot_sgl_count) {
for (i = 0; i < tv_cmd->tvc_prot_sgl_count; i++)
@@ -502,6 +513,27 @@ static void vhost_scsi_evt_work(struct vhost_work *work)
mutex_unlock(&vq->mutex);
 }
 
+static int vhost_scsi_copy_sgl_to_iov(struct vhost_scsi_cmd *cmd)
+{
+   struct iov_iter *iter = &cmd->saved_iter;
+   struct scatterlist *sg = cmd->tvc_sgl;
+   struct page *page;
+   size_t len;
+   int i;
+
+   for (i = 0; i < cmd->tvc_sgl_count; i++) {
+   page = sg_page(&sg[i]);
+   len = sg[i].length;
+
+   if (copy_page_to_iter(page, 0, len, iter) != len) {
+   pr_err("Could not copy data. Error %lu\n", len);
+   return -1;
+   }
+   }
+
+   return 0;
+}
+
 /* Fill in status and signal that we are done processing this command
  *
  * This is scheduled in the vhost work queue so we are called with the owner
@@ -525,15 +557,20 @@ static void vhost_scsi_complete_cmd_work(struct 
vhost_work *work)
 
pr_debug("%s tv_cmd %p resid %u status %#02x\n", __func__,
cmd, se_cmd->residual_count, se_cmd->scsi_status);
-
memset(&v_rsp, 0, sizeof(v_rsp));
-   v_rsp.resid = cpu_to_vhost32(cmd->tvc_vq, 
se_cmd->residual_count);
-   /* TODO is status_qualifier field needed? */
-   v_rsp.status = se_cmd->scsi_status;
-   v_rsp.sense_len = cpu_to_vh

[PATCH 2/3] vhost-scsi: Remove unused write argument

2023-05-24 Thread Mike Christie

The write arg that's passed to the mapping functions is not used so
remove it.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 19 +++
 1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index dbad8fb579eb..382158b39740 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -649,10 +649,8 @@ vhost_scsi_get_cmd(struct vhost_virtqueue *vq, struct 
vhost_scsi_tpg *tpg,
  * Returns the number of scatterlist entries used or -errno on error.
  */
 static int
-vhost_scsi_map_to_sgl(struct vhost_scsi_cmd *cmd,
- struct iov_iter *iter,
- struct scatterlist *sgl,
- bool write)
+vhost_scsi_map_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter,
+ struct scatterlist *sgl)
 {
struct page **pages = cmd->tvc_upages;
struct scatterlist *sg = sgl;
@@ -696,15 +694,14 @@ vhost_scsi_calc_sgls(struct iov_iter *iter, size_t bytes, 
int max_sgls)
 }
 
 static int
-vhost_scsi_iov_to_sgl(struct vhost_scsi_cmd *cmd, bool write,
- struct iov_iter *iter,
+vhost_scsi_iov_to_sgl(struct vhost_scsi_cmd *cmd, struct iov_iter *iter,
  struct scatterlist *sg, int sg_count)
 {
struct scatterlist *p = sg;
int ret;
 
while (iov_iter_count(iter)) {
-   ret = vhost_scsi_map_to_sgl(cmd, iter, sg, write);
+   ret = vhost_scsi_map_to_sgl(cmd, iter, sg);
if (ret < 0) {
while (p < sg) {
struct page *page = sg_page(p++);
@@ -769,7 +766,6 @@ vhost_scsi_mapal(struct vhost_scsi_tpg *tpg, struct 
vhost_scsi_cmd *cmd,
 size_t data_bytes, struct iov_iter *data_iter)
 {
int sgl_count, ret;
-   bool write = (cmd->tvc_data_direction == DMA_FROM_DEVICE);
 
if (prot_bytes) {
sgl_count = vhost_scsi_calc_sgls(prot_iter, prot_bytes,
@@ -782,8 +778,7 @@ vhost_scsi_mapal(struct vhost_scsi_tpg *tpg, struct 
vhost_scsi_cmd *cmd,
pr_debug("%s prot_sg %p prot_sgl_count %u\n", __func__,
 cmd->tvc_prot_sgl, cmd->tvc_prot_sgl_count);
 
-   ret = vhost_scsi_iov_to_sgl(cmd, write, prot_iter,
-   cmd->tvc_prot_sgl,
+   ret = vhost_scsi_iov_to_sgl(cmd, prot_iter, cmd->tvc_prot_sgl,
cmd->tvc_prot_sgl_count);
if (ret < 0)
goto map_fail;
@@ -813,8 +808,8 @@ vhost_scsi_mapal(struct vhost_scsi_tpg *tpg, struct 
vhost_scsi_cmd *cmd,
ret = vhost_scsi_copy_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl,
 cmd->tvc_sgl_count);
else
-   ret = vhost_scsi_iov_to_sgl(cmd, write, data_iter,
-   cmd->tvc_sgl, cmd->tvc_sgl_count);
+   ret = vhost_scsi_iov_to_sgl(cmd, data_iter, cmd->tvc_sgl,
+   cmd->tvc_sgl_count);
if (ret)
goto map_fail;
 
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 0/3] vhost-scsi: Fix IO hangs when using windows

2023-05-24 Thread Mike Christie

The following patches were made over Linus's tree and fix an issue
where windows guests will send iovecs with offset/lengths that result
in IOs that are not aligned to 512. The LIO layer will then send them
to Linux's block layer but it requires 512 byte alignment, so depending
on the block driver being used we will get IO errors or hung IO.

The following patches have vhost-scsi detect when windows sends these
IOs and copy them to a bounce buffer. It then does some cleanup in
the related code.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-23 Thread Mike Christie

Hey Oleg,

For all these questions below let me get back to you by tomorrow.
I need to confirm if something would be considered a regression by
the core vhost developers.

On 5/23/23 7:15 AM, Oleg Nesterov wrote:
> On 05/22, Oleg Nesterov wrote:
>>
>> Right now I think that "int dead" should die,
> 
> No, probably we shouldn't call get_signal() if we have already dequeued 
> SIGKILL.
> 
>> but let me think tomorrow.
> 
> May be something like this... I don't like it but I can't suggest anything 
> better
> right now.
> 
>   bool killed = false;
> 
>   for (;;) {
>   ...
>   
>   node = llist_del_all(&worker->work_list);
>   if (!node) {
>   schedule();
>   /*
>* When we get a SIGKILL our release function will
>* be called. That will stop new IOs from being queued
>* and check for outstanding cmd responses. It will then
>* call vhost_task_stop to tell us to return and exit.
>*/
>   if (signal_pending(current)) {
>   struct ksignal ksig;
> 
>   if (!killed)
>   killed = get_signal(&ksig);
> 
>   clear_thread_flag(TIF_SIGPENDING);
>   }
> 
>   continue;
>   }
> 
> ---
> But let me ask a couple of questions. Let's forget this patch, let's look at 
> the
> current code:
> 
>   node = llist_del_all(&worker->work_list);
>   if (!node)
>   schedule();
> 
>   node = llist_reverse_order(node);
>   ... process works ...
> 
> To me this looks a bit confusing. Shouldn't we do
> 
>   if (!node) {
>   schedule();
>   continue;
>   }
> 
> just to make the code a bit more clear? If node == NULL then
> llist_reverse_order() and llist_for_each_entry_safe() will do nothing.
> But this is minor.
> 
> 
> 
>   /* make sure flag is seen after deletion */
>   smp_wmb();
>   llist_for_each_entry_safe(work, work_next, node, node) {
>   clear_bit(VHOST_WORK_QUEUED, &work->flags);
> 
> I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> vhost_work_queue() can add this work again and change work->node->next.
> 
> That is why we use _safe, but we need to ensure that llist_for_each_safe()
> completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
> 
> So it seems that smp_wmb() can't help and should be removed, instead we need
> 
>   llist_for_each_entry_safe(...) {
>   smp_mb__before_atomic();
>   clear_bit(VHOST_WORK_QUEUED, &work->flags);
> 
> Also, if the work->fn pointer is not stable, we should read it before
> smp_mb__before_atomic() as well.
> 
> No?
> 
> 
>   __set_current_state(TASK_RUNNING);
> 
> Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
> can return with current->state != RUNNING ?
> 
> 
>   work->fn(work);
> 
> Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> before we call work->fn(). Is it "safe" to run this callback with
> signal_pending() or fatal_signal_pending() ?
> 
> 
> Finally. I never looked into drivers/vhost/ before so I don't understand
> this code at all, but let me ask anyway... Can we change vhost_dev_flush()
> to run the pending callbacks rather than wait for vhost_worker() ?
> I guess we can't, ->mm won't be correct, but can you confirm?
> 
> Oleg.
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-23 Thread Mike Christie

On 5/22/23 2:40 PM, Michael S. Tsirkin wrote:
>>  return copy_process(NULL, 0, node, &args);
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 8050fe23c732..a0f00a078cbb 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -2891,6 +2891,7 @@ bool get_signal(struct ksignal *ksig)
>>  
>>  return ksig->sig > 0;
>>  }
>> +EXPORT_SYMBOL_GPL(get_signal);
> 
> If you are exporting this, could you add documentation please?
> 

Ok.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-22 Thread Mike Christie

On 5/22/23 7:30 AM, Oleg Nesterov wrote:
> Confused, please help...
> 
> On 05/21, Mike Christie wrote:
>>
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -338,6 +338,7 @@ static int vhost_worker(void *data)
>>  struct vhost_worker *worker = data;
>>  struct vhost_work *work, *work_next;
>>  struct llist_node *node;
>> +bool dead = false;
>>
>>  for (;;) {
>>  /* mb paired w/ kthread_stop */
>> @@ -349,8 +350,22 @@ static int vhost_worker(void *data)
>>  }
>>
>>  node = llist_del_all(&worker->work_list);
>> -if (!node)
>> +if (!node) {
>>  schedule();
>> +/*
>> + * When we get a SIGKILL our release function will
>> + * be called. That will stop new IOs from being queued
>> + * and check for outstanding cmd responses. It will then
>> + * call vhost_task_stop to tell us to return and exit.
>> + */
> 
> But who will call the release function / vhost_task_stop() and when this
> will happen after this thread gets SIGKILL ?

When we get a SIGKILL, the thread that owns the device/vhost_task will
also exit since it's the same thread group and it does:

do_exit -> exit_files -> put_files_struct -> close_files -> fput

which eventually (the actual put is done via the delayed work path
in there) runs the file_operation->release.

So the release function is being called in parallel more or less as the
code above.

> 
>> +if (!dead && signal_pending(current)) {
>> +struct ksignal ksig;
>> +
>> +dead = get_signal(&ksig);
>> +if (dead)
>> +clear_thread_flag(TIF_SIGPENDING);
> 
> If you do clear_thread_flag(TIF_SIGPENDING), then why do we need 1/3 ?

You're right. I don't need it. I thought __fatal_signal_pending checked
the shared pending as well but it only checks pending.

> Also. Suppose that vhost_worker() dequeues SIGKILL and clears TIF_SIGPENDING.
> 
> SIGSTOP, PTRACE_INTERRUPT, freezer can come and set TIF_SIGPENDING again.
> In this case the main for (;;) loop will spin without sleeping until
> vhost_task_should_stop() becomes true?

I see. So I either have to be able to call get_signal after SIGKILL or
at this time work like a kthread and ignore signals like a

if (dead && signal_pending())
flush_signals()
?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 0/3] vhost: Fix freezer/ps regressions

2023-05-21 Thread Mike Christie

The following patches made over Linus's tree fix the 2 bugs:

1. vhost worker task shows up as a process forked from the parent
that did VHOST_SET_OWNER ioctl instead of a process under root/kthreadd.
This was causing breaking scripts.
2. vhost_tasks didn't disable or add support for freeze requests.

The following patches fix these issues by making the vhost_task task
a thread under the process that did the VHOST_SET_OWNER and uses
get_signal() to handle freeze and SIGSTOP/KILL signals which is required
when using CLONE_THREAD (really CLONE_THREAD requires CLONE_SIGHAND
which requires SIGKILL/STOP to be supported).


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-21 Thread Mike Christie

When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing ps
or ps a would not incorrectly detect the vhost task as another process.
2. kthreads disabled freeze by setting PF_NOFREEZE, but vhost tasks's
didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be suppported.

This a modified version of patch originally written by Linus which
handles his review comment to himself to rename ignore_signals to
block_signals to better represent what it now does. And it includes a
change to vhost_worker() to support SIGSTOP/KILL and freeze, and it
drops the wait use per Oleg's review comment that it's no longer needed
when using CLONE_THREAD.

Fixes: 6e890c5d5021 ("vhost: use vhost_tasks for worker threads")
Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c  | 17 -
 include/linux/sched/task.h |  2 +-
 kernel/fork.c  | 12 +++-
 kernel/signal.c|  1 +
 kernel/vhost_task.c| 16 
 5 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..bf83e9340e40 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -338,6 +338,7 @@ static int vhost_worker(void *data)
struct vhost_worker *worker = data;
struct vhost_work *work, *work_next;
struct llist_node *node;
+   bool dead = false;
 
for (;;) {
/* mb paired w/ kthread_stop */
@@ -349,8 +350,22 @@ static int vhost_worker(void *data)
}
 
node = llist_del_all(&worker->work_list);
-   if (!node)
+   if (!node) {
schedule();
+   /*
+* When we get a SIGKILL our release function will
+* be called. That will stop new IOs from being queued
+* and check for outstanding cmd responses. It will then
+* call vhost_task_stop to tell us to return and exit.
+*/
+   if (!dead && signal_pending(current)) {
+   struct ksignal ksig;
+
+   dead = get_signal(&ksig);
+   if (dead)
+   clear_thread_flag(TIF_SIGPENDING);
+   }
+   }
 
node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..249a5ece9def 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,7 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
-   u32 ignore_signals:1;
+   u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9e04ab5c3946 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
p->flags |= PF_KTHREAD;
if (args->user_worker)
p->flags |= PF_USER_WORKER;
-   if (args->io_thread) {
-   /*
-* Mark us an IO worker, and block any signal that isn't
-* fatal or STOP
-*/
+   if (args->io_thread)
p->flags |= PF_IO_WORKER;
+   if (args->block_signals)
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
-   }
 
if (args->name)
strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
 
-   if (args->ignore_signals)
-   ignore_signals(p);
-
stackleak_task_init(p);
 
if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), 
void *arg, int node)
.fn_arg = arg,
.io_thread  = 1,
.user_worker= 1,
+   .block_signals  = 1,
};
 
return copy_process(NULL, 0, node, &args);
diff --git a/kernel/signal.c b/kernel/signal.c
index 8050fe23c732..a0f00a078cbb 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2891,6 +2891,7 @@ bool get_signal(struct ksignal *ksig)

[PATCH 2/3] signal: Don't exit for PF_USER_WORKER tasks

2023-05-21 Thread Mike Christie

vhost_tasks also need to perform cleanup before exiting so this changes
the check in get_signal to be more generic so both io thread and vhost
tasks can do their cleanup.

Signed-off-by: Mike Christie 
---
 kernel/signal.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 3dc99b9aec7f..8050fe23c732 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2869,11 +2869,11 @@ bool get_signal(struct ksignal *ksig)
}
 
/*
-* PF_IO_WORKER threads will catch and exit on fatal signals
+* PF_USER_WORKER threads will catch and exit on fatal signals
 * themselves. They have cleanup that must be performed, so
 * we cannot call do_exit() on their behalf.
 */
-   if (current->flags & PF_IO_WORKER)
+   if (current->flags & PF_USER_WORKER)
goto out;
 
/*
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 1/3] signal: Don't always put SIGKILL in shared_pending

2023-05-21 Thread Mike Christie

When get_pending detects the task has been marked to be killed we try to
clean up the SIGKLL by doing a sigdelset and recalc_sigpending, but we
still leave it in shared_pending. If the signal is being short circuit
delivered there is no need to put in shared_pending so this adds a check
in complete_signal.

This patch was modified from Eric Biederman 
original patch.

Signed-off-by: Mike Christie 
---
 kernel/signal.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..3dc99b9aec7f 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1052,6 +1052,14 @@ static void complete_signal(int sig, struct task_struct 
*p, enum pid_type type)
signal->flags = SIGNAL_GROUP_EXIT;
signal->group_exit_code = sig;
signal->group_stop_count = 0;
+
+   /*
+* The signal is being short circuit delivered so
+* don't set pending.
+*/
+   if (type != PIDTYPE_PID)
+   sigdelset(&signal->shared_pending.signal, sig);
+
t = p;
do {
task_clear_jobctl_pending(t, 
JOBCTL_PENDING_MASK);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-19 Thread Mike Christie

On 5/18/23 11:16 PM, Eric W. Biederman wrote:
> Mike Christie  writes:
> 
>> On 5/18/23 1:28 PM, Eric W. Biederman wrote:
>>> Still the big issue seems to be the way get_signal is connected into
>>> these threads so that it keeps getting called.  Calling get_signal after
>>> a fatal signal has been returned happens nowhere else and even if we fix
>>> it today it is likely to lead to bugs in the future because whoever is
>>> testing and updating the code is unlikely they have a vhost test case
>>> the care about.
>>>
>>> diff --git a/kernel/signal.c b/kernel/signal.c
>>> index 8f6330f0e9ca..4d54718cad36 100644
>>> --- a/kernel/signal.c
>>> +++ b/kernel/signal.c
>>> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>>>  
>>>  void recalc_sigpending(void)
>>>  {
>>> -   if (!recalc_sigpending_tsk(current) && !freezing(current))
>>> +   if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
>>> +   ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
>>> +   !__fatal_signal_pending(current)))
>>> clear_thread_flag(TIF_SIGPENDING);
>>>  
>>>  }
>>> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct 
>>> task_struct *p, enum pid_type type)
>>>  * This signal will be fatal to the whole group.
>>>  */
>>> if (!sig_kernel_coredump(sig)) {
>>> +   /*
>>> +* The signal is being short circuit delivered
>>> +* don't it pending.
>>> +*/
>>> +   if (type != PIDTYPE_PID) {
>>> +   sigdelset(&t->signal->shared_pending,  sig);
>>> +
>>> /*
>>>  * Start a group exit and wake everybody up.
>>>  * This way we don't have other threads
>>>
>>
>> If I change up your patch so the last part is moved down a bit to when we 
>> set t
>> like this:
>>
>> diff --git a/kernel/signal.c b/kernel/signal.c
>> index 0ac48c96ab04..c976a80650db 100644
>> --- a/kernel/signal.c
>> +++ b/kernel/signal.c
>> @@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>>  
>>  void recalc_sigpending(void)
>>  {
>> -if (!recalc_sigpending_tsk(current) && !freezing(current))
>> +if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
>> +((current->signal->flags & SIGNAL_GROUP_EXIT) &&
>> + !__fatal_signal_pending(current)))
>>  clear_thread_flag(TIF_SIGPENDING);
>> -
> Can we get rid of this suggestion to recalc_sigpending.  The more I look
> at it the more I am convinced it is not safe.  In particular I believe
> it is incompatible with dump_interrupted() in fs/coredump.c


With your clear_thread_flag call in vhost_worker suggestion I don't need
the above chunk.


> 
> The code in fs/coredump.c is the closest code we have to what you are
> trying to do with vhost_worker after the session is killed.  It also
> struggles with TIF_SIGPENDING getting set. 
>>  }
>>  EXPORT_SYMBOL(recalc_sigpending);
>>  
>> @@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct 
>> task_struct *p, enum pid_type type)
>>  signal->group_exit_code = sig;
>>  signal->group_stop_count = 0;
>>  t = p;
>> +/*
>> + * The signal is being short circuit delivered
>> + * don't it pending.
>> + */
>> +if (type != PIDTYPE_PID) {
>> +struct sigpending *pending;
>> +
>> +pending = &t->signal->shared_pending;
>> +sigdelset(&pending->signal, sig);
>> +}
>> +
>>  do {
>>  task_clear_jobctl_pending(t, 
>> JOBCTL_PENDING_MASK);
>>  sigaddset(&t->pending.signal, SIGKILL);
>>
>>
>> Then get_signal() works like how Oleg mentioned it should earlier.
> 
> I am puzzled it makes a difference as t->signal and p->signal should
> point to the same thing, and in fact the code

Re: [PATCH 13/14] vhost: allow userspace to create workers

2023-05-18 Thread Mike Christie

On 5/16/23 10:10 PM, Jason Wang wrote:
> On Sat, Apr 29, 2023 at 12:32 AM Mike Christie
>  wrote:
>>
>> For vhost-scsi with 3 vqs or more and a workload that tries to use
>> them in parallel like:
>>
>> fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
>> --ioengine=libaio --iodepth=128  --numjobs=3
>>
>> the single vhost worker thread will become a bottlneck and we are stuck
>> at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are
>> used.
>>
>> To better utilize virtqueues and available CPUs, this patch allows
>> userspace to create workers and bind them to vqs. You can have N workers
>> per dev and also share N workers with M vqs on that dev.
>>
>> This patch adds the interface related code and the next patch will hook
>> vhost-scsi into it. The patches do not try to hook net and vsock into
>> the interface because:
>>
>> 1. multiple workers don't seem to help vsock. The problem is that with
>> only 2 virtqueues we never fully use the existing worker when doing
>> bidirectional tests. This seems to match vhost-scsi where we don't see
>> the worker as a bottleneck until 3 virtqueues are used.
>>
>> 2. net already has a way to use multiple workers.
>>
>> Signed-off-by: Mike Christie 
>> ---
>>  drivers/vhost/vhost.c| 145 ++-
>>  drivers/vhost/vhost.h|   3 +
>>  include/uapi/linux/vhost.h   |  33 +++
>>  include/uapi/linux/vhost_types.h |  16 
>>  4 files changed, 196 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 4b0b82292379..e8f829f35814 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -630,6 +630,80 @@ static struct vhost_worker *vhost_worker_create(struct 
>> vhost_dev *dev)
>> return NULL;
>>  }
>>
>> +/* Caller must have device mutex */
>> +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
>> +struct vhost_worker *worker)
>> +{
>> +   if (vq->worker)
>> +   vq->worker->attachment_cnt--;
>> +   worker->attachment_cnt++;
>> +   vq->worker = worker;
>> +}
>> +
>> +/**
>> + * vhost_vq_attach_worker - set a virtqueue's worker from an ioctl command
>> + * @vq: the virtqueue we will set the worker for
>> + * @info: the worker userspace has requested us to use
>> + *
>> + * We only allow userspace to set a virtqueue's worker if it's not active 
>> and
>> + * polling is not enabled.
> 
> I wonder if we can mandate this in the code like check the vq backend
> in vhost_vq_work_queue().

I'll look into it. However, for the vsock case we actually do want to
queue the work even though there is no backend set yet. It sort of just
keeps queueing works until VHOST_VSOCK_SET_RUNNING is executed. When that's
done it will run the works that have been queueing up.


> 
>  We also assume drivers supporting this will not be
>> + * internally queueing works directly or via calls like vhost_dev_flush at
>> + * this time.
>> + *
>> + * Caller must have device and virtqueue mutex.
>> + */
>> +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq,
>> + struct vhost_vring_worker *info)
>> +{
>> +   unsigned long index = info->worker_id;
>> +   struct vhost_dev *dev = vq->dev;
>> +   struct vhost_worker *worker;
>> +
>> +   if (!dev->use_worker)
>> +   return -EINVAL;
>> +
>> +   if (vhost_vq_get_backend(vq) || vq->kick)
> 
> It might be worthwhile to have a comment to explain why we need to
> check vq->kick here.
> 
> This also means the device should not queue work when the backend is NULL.
> 
> But I found it is probably not the case for vsock, it calls
> vhost_poll_queue() in vhost_transport_cancel_pkt() but
> vhost_vsock_stop() doesn't wait before doing vhost_vq_set_backend(vq,
> NULL);
Yeah, there was another case where vhost_transport_send_pkt can call
vhost_work_queue before the backend is set.

I ended up doing the opt in method though. I did a test to convert vsock
and a worker for the recv queue and one for the xmit queue doesn't help.
It was like with vhost-scsi where with just 2 virtqueues, one worker could
handle the load. So I thought there is no point in adding extra code for
vsock if it wasn't going to be used.


> 
> Net seems to be fine since it waits for ubufs to be completed in
> vhost_net_se

Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-18 Thread Mike Christie

On 5/18/23 1:28 PM, Eric W. Biederman wrote:
> Still the big issue seems to be the way get_signal is connected into
> these threads so that it keeps getting called.  Calling get_signal after
> a fatal signal has been returned happens nowhere else and even if we fix
> it today it is likely to lead to bugs in the future because whoever is
> testing and updating the code is unlikely they have a vhost test case
> the care about.
> 
> diff --git a/kernel/signal.c b/kernel/signal.c
> index 8f6330f0e9ca..4d54718cad36 100644
> --- a/kernel/signal.c
> +++ b/kernel/signal.c
> @@ -181,7 +181,9 @@ void recalc_sigpending_and_wake(struct task_struct *t)
>  
>  void recalc_sigpending(void)
>  {
> -   if (!recalc_sigpending_tsk(current) && !freezing(current))
> +   if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
> +   ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
> +   !__fatal_signal_pending(current)))
> clear_thread_flag(TIF_SIGPENDING);
>  
>  }
> @@ -1043,6 +1045,13 @@ static void complete_signal(int sig, struct 
> task_struct *p, enum pid_type type)
>  * This signal will be fatal to the whole group.
>  */
> if (!sig_kernel_coredump(sig)) {
> +   /*
> +* The signal is being short circuit delivered
> +* don't it pending.
> +*/
> +   if (type != PIDTYPE_PID) {
> +   sigdelset(&t->signal->shared_pending,  sig);
> +
> /*
>  * Start a group exit and wake everybody up.
>  * This way we don't have other threads
> 

If I change up your patch so the last part is moved down a bit to when we set t
like this:

diff --git a/kernel/signal.c b/kernel/signal.c
index 0ac48c96ab04..c976a80650db 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -181,9 +181,10 @@ void recalc_sigpending_and_wake(struct task_struct *t)
 
 void recalc_sigpending(void)
 {
-   if (!recalc_sigpending_tsk(current) && !freezing(current))
+   if ((!recalc_sigpending_tsk(current) && !freezing(current)) ||
+   ((current->signal->flags & SIGNAL_GROUP_EXIT) &&
+!__fatal_signal_pending(current)))
clear_thread_flag(TIF_SIGPENDING);
-
 }
 EXPORT_SYMBOL(recalc_sigpending);
 
@@ -1053,6 +1054,17 @@ static void complete_signal(int sig, struct task_struct 
*p, enum pid_type type)
signal->group_exit_code = sig;
signal->group_stop_count = 0;
t = p;
+   /*
+* The signal is being short circuit delivered
+* don't it pending.
+*/
+   if (type != PIDTYPE_PID) {
+   struct sigpending *pending;
+
+   pending = &t->signal->shared_pending;
+   sigdelset(&pending->signal, sig);
+   }
+
do {
task_clear_jobctl_pending(t, 
JOBCTL_PENDING_MASK);
sigaddset(&t->pending.signal, SIGKILL);


Then get_signal() works like how Oleg mentioned it should earlier.

For vhost I just need the code below which is just Linus's patch plus a call
to get_signal() in vhost_worker() and the PF_IO_WORKER->PF_USER_WORKER change.

Note that when we get SIGKILL, the vhost file_operations->release function is 
called via

do_exit -> exit_files -> put_files_struct -> close_files

and so the vhost release function starts to flush IO and stop the worker/vhost
task. In vhost_worker() then we just handle those last completions for already
running IO. When  the vhost release function detects they are done it does
vhost_task_stop() and vhost_worker() returns and then vhost_task_fn() does 
do_exit().
So we don't return immediately when get_signal() returns non-zero.

So it works, but it sounds like you don't like vhost relying on the behavior,
and it's non standard to use get_signal() like we are. So I'm not sure how we
want to proceed.

Maybe the safest is to revert:

commit 6e890c5d5021ca7e69bbe203fde42447874d9a82
Author: Mike Christie 
Date:   Fri Mar 10 16:03:32 2023 -0600

vhost: use vhost_tasks for worker threads

and retry this for the next kernel when we can do proper testing and more
code review?


diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..1ba9e068b2ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -349,8 +349,16 @@ static int vhost

Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-18 Thread Mike Christie

On 5/18/23 11:25 AM, Oleg Nesterov wrote:
> I too do not understand the 1st change in this patch ...
> 
> On 05/18, Mike Christie wrote:
>>
>> In the other patches we do:
>>
>> if (get_signal(ksig))
>>  start_exit_cleanup_by_stopping_newIO()
>>  flush running IO()
>>  exit()
>>
>> But to do the flush running IO() part of this I need to wait for it so
>> that's why I wanted to be able to dequeue the SIGKILL and clear the
>> TIF_SIGPENDING bit.
> 
> But get_signal() will do what you need, dequeue SIGKILL and clear SIGPENDING ?
> 
>   if ((signal->flags & SIGNAL_GROUP_EXIT) ||
>signal->group_exec_task) {
>   clear_siginfo(&ksig->info);
>   ksig->info.si_signo = signr = SIGKILL;
>   sigdelset(¤t->pending.signal, SIGKILL);
> 
> this "dequeues" SIGKILL,
> 
>   trace_signal_deliver(SIGKILL, SEND_SIG_NOINFO,
>   &sighand->action[SIGKILL - 1]);
>   recalc_sigpending();
> 
> this clears TIF_SIGPENDING.
> 

I see what you guys meant. TIF_SIGPENDING isn't getting cleared.
I'll dig into why.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-18 Thread Mike Christie

On 5/18/23 3:08 AM, Christian Brauner wrote:
> On Wed, May 17, 2023 at 07:09:13PM -0500, Mike Christie wrote:
>> This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
>> set when we are dealing with PF_USER_WORKER tasks.
>>
>> When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
>> We can easily stop new work/IO from being queued to the vhost_task, but
>> for IO that's already been sent to something like the block layer we
>> need to wait for the response then process it. These type of IO
>> completions use the vhost_task to process the completion so we can't
>> exit immediately.
>>
>> We need to handle wait for then handle those completions from the
>> vhost_task, but when we have a SIGKLL pending, functions like
>> schedule() return immediately so we can't wait like normal. Functions
>> like vhost_worker() degrade to just a while(1); loop.
>>
>> This patch has get_signal drop down to the normal code path when
>> SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
>> there is a SIGKILL but still perform some blocking cleanup.
>>
>> Note that in that chunk I'm now bypassing that does:
>>
>> sigdelset(¤t->pending.signal, SIGKILL);
>>
>> we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
>> group_exec_task we are already doing that on the threads in the
>> group.
>>
>> Signed-off-by: Mike Christie 
>> ---
> 
> I think you just got confused by the original discussion that was split
> into two separate threads:
> 
> (1) The discussion based on your original proposal to adjust the signal
> handling logic to accommodate vhost workers as they are right now.
> That's where Oleg jumped in.
> (2) My request - which you did in this series - of rewriting vhost
> workers to behave more like io_uring workers.
> 
> Both problems are orthogonal. The gist of my proposal is to avoid (1) by
> doing (2). So the only change that's needed is
> s/PF_IO_WORKER/PF_USER_WORKER/ which is pretty obvious as io_uring
> workers and vhost workers no almost fully collapse into the same
> concept.
> 
> So forget (1). If additional signal patches are needed as discussed in
> (1) then it must be because of a bug that would affect io_uring workers
> today.

I maybe didn't exactly misunderstand you. I did patch 1/8 to show issues I
hit when I'm doing 2-8. See my reply to Eric's question about what I'm
hitting and why the last part of the patch only did not work for me:

https://lore.kernel.org/lkml/20230518000920.191583-2-michael.chris...@oracle.com/T/#mc6286d1a42c79761248ba55f1dd7a433379be6d1
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-18 Thread Mike Christie

On 5/17/23 10:49 PM, Eric W. Biederman wrote:
> 
> Long story short.
> 
> In the patch below the first hunk is a noop.
> 
> The code you are bypassing was added to ensure that process termination
> (aka SIGKILL) is processed before any other signals.  Other than signal
> processing order there are not any substantive differences in the two
> code paths.  With all signals except SIGSTOP == 19 and SIGKILL == 9
> blocked SIGKILL should always be processed before SIGSTOP.
> 
> Can you try patch with just the last hunk that does
> s/PF_IO_WORKER/PF_USER_WORKER/ and see if that is enough?
> 

If I just have the last hunk and then we get SIGKILL what happens is
in code like:

vhost_worker()

schedule()
if (has IO)
handle_IO()

The schedule() calls will hit the signal_pending_state check for
signal_pending or __fatal_signal_pending and so instead of waiting
for whatever wake_up call we normally waited for we tend to just
return immediately. If you just run Qemu (the parent of the vhost_task)
and send SIGKILL then sometimes the vhost_task just spins and it
would look like the task has taken over the CPU (this is what I hit
when I tested Linus's patch).

With the first hunk of the patch, we will end up dequeuing the SIGKILL
and clearing TIF_SIGPENDING, so the vhost_task can still do some work
before it exits.

In the other patches we do:

if (get_signal(ksig))
start_exit_cleanup_by_stopping_newIO()
flush running IO()
exit()

But to do the flush running IO() part of this I need to wait for it so
that's why I wanted to be able to dequeue the SIGKILL and clear the
TIF_SIGPENDING bit.

Or I don't need this specifically. In patch 0/8 I said I knew you guys
would not like it :) If I just have a:

if (fatal_signal())
clear_fatal_signal()

then it would work for me.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones

2023-05-18 Thread Mike Christie

On 5/18/23 9:18 AM, Christian Brauner wrote:
>> @@ -352,12 +353,13 @@ static int vhost_worker(void *data)
>>  if (!node) {
>>  schedule();
>>  /*
>> - * When we get a SIGKILL our release function will
>> - * be called. That will stop new IOs from being queued
>> - * and check for outstanding cmd responses. It will then
>> - * call vhost_task_stop to exit us.
>> + * When we get a SIGKILL we kick off a work to
>> + * run the driver's helper to stop new work and
>> + * handle completions. When they are done they will
>> + * call vhost_task_stop to tell us to exit.
>>   */
>> -vhost_task_get_signal();
>> +if (vhost_task_get_signal())
>> +schedule_work(&dev->destroy_worker);
>>  }
> 
> I'm pretty sure you still need to actually call exit here. Basically
> mirror what's done in io_worker_exit() minus the io specific bits.

We do call do_exit(). Once destory_worker has flushed the device and
all outstanding IO has completed it call vhost_task_stop(). vhost_worker()
above then breaks out of the loop and returns and vhost_task_fn() does
do_exit().
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 8/8] fork/vhost_task: remove no_files

2023-05-17 Thread Mike Christie

On 5/17/23 7:09 PM, Mike Christie wrote:
> +   CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND,

Sorry. I tried to throw this one in last second so we could see that
we can also see that we can now use CLONE_FILES like io_uring.
It will of course not compile.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler

2023-05-17 Thread Mike Christie

On 5/17/23 7:16 PM, Linus Torvalds wrote:
> On Wed, May 17, 2023 at 5:09 PM Mike Christie
>  wrote:
>>
>> +   __set_current_state(TASK_RUNNING);
>> +   rc = get_signal(&ksig);
>> +   set_current_state(TASK_INTERRUPTIBLE);
>> +   return rc;
> 
> The games with current_state seem nonsensical.
> 
> What are they all about? get_signal() shouldn't care, and no other
> caller does this thing. This just seems completely random.

Sorry. It's a leftover.

I was originally calling this from vhost_task_should_stop where before
calling that function we do a:

set_current_state(TASK_INTERRUPTIBLE);

So, I was hitting get_signal->try_to_freeze->might_sleep->__might_sleep
and was getting the "do not call blocking ops when !TASK_RUNNING"
warnings.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH 8/8] fork/vhost_task: remove no_files

2023-05-17 Thread Mike Christie

The vhost_task can now support the worker being freed from under the
device when we get a SIGKILL or the process exits without closing
devices. We no longer need no_files so this removes it.

Signed-off-by: Mike Christie 
---
 include/linux/sched/task.h |  1 -
 kernel/fork.c  | 10 ++
 kernel/vhost_task.c|  3 +--
 3 files changed, 3 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 249a5ece9def..342fe297ffd4 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -28,7 +28,6 @@ struct kernel_clone_args {
u32 kthread:1;
u32 io_thread:1;
u32 user_worker:1;
-   u32 no_files:1;
u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
diff --git a/kernel/fork.c b/kernel/fork.c
index 9e04ab5c3946..f2c081c15efb 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1769,8 +1769,7 @@ static int copy_fs(unsigned long clone_flags, struct 
task_struct *tsk)
return 0;
 }
 
-static int copy_files(unsigned long clone_flags, struct task_struct *tsk,
- int no_files)
+static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
 {
struct files_struct *oldf, *newf;
int error = 0;
@@ -1782,11 +1781,6 @@ static int copy_files(unsigned long clone_flags, struct 
task_struct *tsk,
if (!oldf)
goto out;
 
-   if (no_files) {
-   tsk->files = NULL;
-   goto out;
-   }
-
if (clone_flags & CLONE_FILES) {
atomic_inc(&oldf->count);
goto out;
@@ -2488,7 +2482,7 @@ __latent_entropy struct task_struct *copy_process(
retval = copy_semundo(clone_flags, p);
if (retval)
goto bad_fork_cleanup_security;
-   retval = copy_files(clone_flags, p, args->no_files);
+   retval = copy_files(clone_flags, p);
if (retval)
goto bad_fork_cleanup_semundo;
retval = copy_fs(clone_flags, p);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index a11f036290cc..642047765190 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -96,12 +96,11 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), 
void *arg,
 {
struct kernel_clone_args args = {
.flags  = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
- CLONE_THREAD | CLONE_SIGHAND,
+ CLONE_THREAD | CLONE_FILES, CLONE_SIGHAND,
.exit_signal= 0,
.fn = vhost_task_fn,
.name   = name,
.user_worker= 1,
-   .no_files   = 1,
.block_signals  = 1,
};
struct vhost_task *vtsk;
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH 6/8] vhost-scsi: Add callback to stop and wait on works

2023-05-17 Thread Mike Christie

This moves the scsi code we use to stop new works from being queued
and wait on running works to a helper which is used by the vhost layer
when the vhost_task is being killed by a SIGKILL.

Signed-off-by: Mike Christie 
---
 drivers/vhost/scsi.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 40f9135e1a62..a0f2588270f2 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1768,6 +1768,19 @@ static int vhost_scsi_set_features(struct vhost_scsi 
*vs, u64 features)
return 0;
 }
 
+static void vhost_scsi_stop_dev_work(struct vhost_dev *dev)
+{
+   struct vhost_scsi *vs = container_of(dev, struct vhost_scsi, dev);
+   struct vhost_scsi_target t;
+
+   mutex_lock(&vs->dev.mutex);
+   memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn));
+   mutex_unlock(&vs->dev.mutex);
+   vhost_scsi_clear_endpoint(vs, &t);
+   vhost_dev_stop(&vs->dev);
+   vhost_dev_cleanup(&vs->dev);
+}
+
 static int vhost_scsi_open(struct inode *inode, struct file *f)
 {
struct vhost_scsi *vs;
@@ -1821,7 +1834,7 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
}
vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0,
-  true, NULL, NULL);
+  true, NULL, vhost_scsi_stop_dev_work);
 
vhost_scsi_init_inflight(vs, NULL);
 
@@ -1843,14 +1856,8 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
 static int vhost_scsi_release(struct inode *inode, struct file *f)
 {
struct vhost_scsi *vs = f->private_data;
-   struct vhost_scsi_target t;
 
-   mutex_lock(&vs->dev.mutex);
-   memcpy(t.vhost_wwpn, vs->vs_vhost_wwpn, sizeof(t.vhost_wwpn));
-   mutex_unlock(&vs->dev.mutex);
-   vhost_scsi_clear_endpoint(vs, &t);
-   vhost_dev_stop(&vs->dev);
-   vhost_dev_cleanup(&vs->dev);
+   vhost_dev_stop_work(&vs->dev);
kfree(vs->dev.vqs);
kfree(vs->vqs);
kfree(vs->old_inflight);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH 7/8] vhost-net: Add callback to stop and wait on works

2023-05-17 Thread Mike Christie

This moves the net code we use to stop new works from being queued
and wait on running works to a helper which is used by the vhost layer
when the vhost_task is being killed by a SIGKILL.

Signed-off-by: Mike Christie 
---
 drivers/vhost/net.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 90c25127b3f8..f8a5527b15ba 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1325,9 +1325,9 @@ static void vhost_net_flush(struct vhost_net *n)
}
 }
 
-static int vhost_net_release(struct inode *inode, struct file *f)
+static void vhost_net_stop_dev_work(struct vhost_dev *dev)
 {
-   struct vhost_net *n = f->private_data;
+   struct vhost_net *n = container_of(dev, struct vhost_net, dev);
struct socket *tx_sock;
struct socket *rx_sock;
 
@@ -1345,6 +1345,13 @@ static int vhost_net_release(struct inode *inode, struct 
file *f)
/* We do an extra flush before freeing memory,
 * since jobs can re-queue themselves. */
vhost_net_flush(n);
+}
+
+static int vhost_net_release(struct inode *inode, struct file *f)
+{
+   struct vhost_net *n = f->private_data;
+
+   vhost_dev_stop_work(&n->dev);
kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
kfree(n->dev.vqs);
@@ -1409,7 +1416,7 @@ static int vhost_net_open(struct inode *inode, struct 
file *f)
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
   UIO_MAXIOV + VHOST_NET_BATCH,
   VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
-  NULL, NULL);
+  NULL, vhost_net_stop_dev_work);
 
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, 
dev);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH 5/8] vhost: Add callback that stops new work and waits on running ones

2023-05-17 Thread Mike Christie

When the vhost_task gets a SIGKILL we want to stop new work from being
queued and also wait for and handle completions for running work. For the
latter, we still need to use the vhost_task to handle the completing work
so we can't just exit right away. But, this has us kick off the stopping
and flushing/stopping of the device/vhost_task/worker to the system work
queue while the vhost_task handles completions. When all completions are
done we will then do vhost_task_stop and we will exit.

Signed-off-by: Mike Christie 
---
 drivers/vhost/net.c   |  2 +-
 drivers/vhost/scsi.c  |  4 ++--
 drivers/vhost/test.c  |  3 ++-
 drivers/vhost/vdpa.c  |  2 +-
 drivers/vhost/vhost.c | 48 ---
 drivers/vhost/vhost.h | 10 -
 drivers/vhost/vsock.c |  4 ++--
 7 files changed, 58 insertions(+), 15 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 8557072ff05e..90c25127b3f8 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1409,7 +1409,7 @@ static int vhost_net_open(struct inode *inode, struct 
file *f)
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
   UIO_MAXIOV + VHOST_NET_BATCH,
   VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
-  NULL);
+  NULL, NULL);
 
vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, 
dev);
vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index bb10fa4bb4f6..40f9135e1a62 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -1820,8 +1820,8 @@ static int vhost_scsi_open(struct inode *inode, struct 
file *f)
vqs[i] = &vs->vqs[i].vq;
vs->vqs[i].vq.handle_kick = vhost_scsi_handle_kick;
}
-   vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV,
-  VHOST_SCSI_WEIGHT, 0, true, NULL);
+   vhost_dev_init(&vs->dev, vqs, nvqs, UIO_MAXIOV, VHOST_SCSI_WEIGHT, 0,
+  true, NULL, NULL);
 
vhost_scsi_init_inflight(vs, NULL);
 
diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 42c955a5b211..11a2823d7532 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -120,7 +120,8 @@ static int vhost_test_open(struct inode *inode, struct file 
*f)
vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
-  VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL);
+  VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT, true, NULL,
+  NULL);
 
f->private_data = n;
 
diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 8c1aefc865f0..de9a83ecb70d 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -1279,7 +1279,7 @@ static int vhost_vdpa_open(struct inode *inode, struct 
file *filep)
vqs[i]->handle_kick = handle_vq_kick;
}
vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
-  vhost_vdpa_process_iotlb_msg);
+  vhost_vdpa_process_iotlb_msg, NULL);
 
r = vhost_vdpa_alloc_domain(v);
if (r)
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1ba9e068b2ab..4163c86db50c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -336,6 +336,7 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 static int vhost_worker(void *data)
 {
struct vhost_worker *worker = data;
+   struct vhost_dev *dev = worker->dev;
struct vhost_work *work, *work_next;
struct llist_node *node;
 
@@ -352,12 +353,13 @@ static int vhost_worker(void *data)
if (!node) {
schedule();
/*
-* When we get a SIGKILL our release function will
-* be called. That will stop new IOs from being queued
-* and check for outstanding cmd responses. It will then
-* call vhost_task_stop to exit us.
+* When we get a SIGKILL we kick off a work to
+* run the driver's helper to stop new work and
+* handle completions. When they are done they will
+* call vhost_task_stop to tell us to exit.
 */
-   vhost_task_get_signal();
+   if (vhost_task_get_signal())
+   schedule_work(&dev->destroy_worker);
}
 
node = llist_reverse_order(node);
@@ -376,6 +378,33 @@ static int vhost_worker(void *data)
return 0;
 }
 
+static void __vhost_dev_stop_work(struct vhost_dev *dev)
+{
+   mutex_lock(&dev->stop_work_mutex);
+

[RFC PATCH 4/8] vhost-net: Move vhost_net_open

2023-05-17 Thread Mike Christie

This moves vhost_net_open so in the next patches we can pass
vhost_dev_init a new helper which will use the stop/flush functions.
There is no functionality changes in this patch.

Signed-off-by: Mike Christie 
---
 drivers/vhost/net.c | 134 ++--
 1 file changed, 67 insertions(+), 67 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 07181cd8d52e..8557072ff05e 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1285,73 +1285,6 @@ static void handle_rx_net(struct vhost_work *work)
handle_rx(net);
 }
 
-static int vhost_net_open(struct inode *inode, struct file *f)
-{
-   struct vhost_net *n;
-   struct vhost_dev *dev;
-   struct vhost_virtqueue **vqs;
-   void **queue;
-   struct xdp_buff *xdp;
-   int i;
-
-   n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
-   if (!n)
-   return -ENOMEM;
-   vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL);
-   if (!vqs) {
-   kvfree(n);
-   return -ENOMEM;
-   }
-
-   queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
- GFP_KERNEL);
-   if (!queue) {
-   kfree(vqs);
-   kvfree(n);
-   return -ENOMEM;
-   }
-   n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue;
-
-   xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL);
-   if (!xdp) {
-   kfree(vqs);
-   kvfree(n);
-   kfree(queue);
-   return -ENOMEM;
-   }
-   n->vqs[VHOST_NET_VQ_TX].xdp = xdp;
-
-   dev = &n->dev;
-   vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
-   vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
-   n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
-   n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
-   for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
-   n->vqs[i].ubufs = NULL;
-   n->vqs[i].ubuf_info = NULL;
-   n->vqs[i].upend_idx = 0;
-   n->vqs[i].done_idx = 0;
-   n->vqs[i].batched_xdp = 0;
-   n->vqs[i].vhost_hlen = 0;
-   n->vqs[i].sock_hlen = 0;
-   n->vqs[i].rx_ring = NULL;
-   vhost_net_buf_init(&n->vqs[i].rxq);
-   }
-   vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
-  UIO_MAXIOV + VHOST_NET_BATCH,
-  VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
-  NULL);
-
-   vhost_poll_init(n->poll + VHOST_NET_VQ_TX, handle_tx_net, EPOLLOUT, 
dev);
-   vhost_poll_init(n->poll + VHOST_NET_VQ_RX, handle_rx_net, EPOLLIN, dev);
-
-   f->private_data = n;
-   n->page_frag.page = NULL;
-   n->refcnt_bias = 0;
-
-   return 0;
-}
-
 static struct socket *vhost_net_stop_vq(struct vhost_net *n,
struct vhost_virtqueue *vq)
 {
@@ -1421,6 +1354,73 @@ static int vhost_net_release(struct inode *inode, struct 
file *f)
return 0;
 }
 
+static int vhost_net_open(struct inode *inode, struct file *f)
+{
+   struct vhost_net *n;
+   struct vhost_dev *dev;
+   struct vhost_virtqueue **vqs;
+   void **queue;
+   struct xdp_buff *xdp;
+   int i;
+
+   n = kvmalloc(sizeof *n, GFP_KERNEL | __GFP_RETRY_MAYFAIL);
+   if (!n)
+   return -ENOMEM;
+   vqs = kmalloc_array(VHOST_NET_VQ_MAX, sizeof(*vqs), GFP_KERNEL);
+   if (!vqs) {
+   kvfree(n);
+   return -ENOMEM;
+   }
+
+   queue = kmalloc_array(VHOST_NET_BATCH, sizeof(void *),
+ GFP_KERNEL);
+   if (!queue) {
+   kfree(vqs);
+   kvfree(n);
+   return -ENOMEM;
+   }
+   n->vqs[VHOST_NET_VQ_RX].rxq.queue = queue;
+
+   xdp = kmalloc_array(VHOST_NET_BATCH, sizeof(*xdp), GFP_KERNEL);
+   if (!xdp) {
+   kfree(vqs);
+   kvfree(n);
+   kfree(queue);
+   return -ENOMEM;
+   }
+   n->vqs[VHOST_NET_VQ_TX].xdp = xdp;
+
+   dev = &n->dev;
+   vqs[VHOST_NET_VQ_TX] = &n->vqs[VHOST_NET_VQ_TX].vq;
+   vqs[VHOST_NET_VQ_RX] = &n->vqs[VHOST_NET_VQ_RX].vq;
+   n->vqs[VHOST_NET_VQ_TX].vq.handle_kick = handle_tx_kick;
+   n->vqs[VHOST_NET_VQ_RX].vq.handle_kick = handle_rx_kick;
+   for (i = 0; i < VHOST_NET_VQ_MAX; i++) {
+   n->vqs[i].ubufs = NULL;
+   n->vqs[i].ubuf_info = NULL;
+   n->vqs[i].upend_idx = 0;
+   n->vqs[i].done_idx = 0;
+   n->vqs[i].batched_xdp = 0;
+   n->vqs[i].vhost_hlen = 0;
+   n->vqs[i].sock_hlen = 0;
+   n->vqs[i].rx_ring = NULL;
+   vhost_

[RFC PATCH 1/8] signal: Dequeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is set

2023-05-17 Thread Mike Christie

This has us deqeue SIGKILL even if SIGNAL_GROUP_EXIT/group_exec_task is
set when we are dealing with PF_USER_WORKER tasks.

When a vhost_task gets a SIGKILL, we could have outstanding IO in flight.
We can easily stop new work/IO from being queued to the vhost_task, but
for IO that's already been sent to something like the block layer we
need to wait for the response then process it. These type of IO
completions use the vhost_task to process the completion so we can't
exit immediately.

We need to handle wait for then handle those completions from the
vhost_task, but when we have a SIGKLL pending, functions like
schedule() return immediately so we can't wait like normal. Functions
like vhost_worker() degrade to just a while(1); loop.

This patch has get_signal drop down to the normal code path when
SIGNAL_GROUP_EXIT/group_exec_task is set so the caller can still detect
there is a SIGKILL but still perform some blocking cleanup.

Note that in that chunk I'm now bypassing that does:

sigdelset(¤t->pending.signal, SIGKILL);

we look to be ok, because in the places we set SIGNAL_GROUP_EXIT/
group_exec_task we are already doing that on the threads in the
group.

Signed-off-by: Mike Christie 
---
 kernel/signal.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..ae4972eea5db 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2705,9 +2705,18 @@ bool get_signal(struct ksignal *ksig)
struct k_sigaction *ka;
enum pid_type type;
 
-   /* Has this task already been marked for death? */
-   if ((signal->flags & SIGNAL_GROUP_EXIT) ||
-signal->group_exec_task) {
+   /*
+* Has this task already been marked for death?
+*
+* If this is a PF_USER_WORKER then the task may need to do
+* extra work that requires waiting on running work, so we want
+* to dequeue the signal below and tell the caller its time to
+* start its exit procedure. When the work has completed then
+* the task will exit.
+*/
+   if (!(current->flags & PF_USER_WORKER) &&
+   ((signal->flags & SIGNAL_GROUP_EXIT) ||
+signal->group_exec_task)) {
clear_siginfo(&ksig->info);
ksig->info.si_signo = signr = SIGKILL;
sigdelset(¤t->pending.signal, SIGKILL);
@@ -2861,11 +2870,11 @@ bool get_signal(struct ksignal *ksig)
}
 
/*
-* PF_IO_WORKER threads will catch and exit on fatal signals
+* PF_USER_WORKER threads will catch and exit on fatal signals
 * themselves. They have cleanup that must be performed, so
 * we cannot call do_exit() on their behalf.
 */
-   if (current->flags & PF_IO_WORKER)
+   if (current->flags & PF_USER_WORKER)
goto out;
 
/*
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH 2/8] vhost/vhost_task: Hook vhost layer into signal handler

2023-05-17 Thread Mike Christie

This patch has vhost use get_signal to handle freezing and sort of
handle signals. By the latter I mean that when we get SIGKILL, our
parent will exit and call our file_operatons release function. That will
then stop new work from breing queued and wait for the vhost_task to
handle completions for running IO. We then exit when those are done.

The next patches will then have us work more like io_uring where
we handle the get_signal return value and key off that to cleanup.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c| 10 +-
 include/linux/sched/vhost_task.h |  1 +
 kernel/vhost_task.c  | 20 
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..1ba9e068b2ab 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -349,8 +349,16 @@ static int vhost_worker(void *data)
}
 
node = llist_del_all(&worker->work_list);
-   if (!node)
+   if (!node) {
schedule();
+   /*
+* When we get a SIGKILL our release function will
+* be called. That will stop new IOs from being queued
+* and check for outstanding cmd responses. It will then
+* call vhost_task_stop to exit us.
+*/
+   vhost_task_get_signal();
+   }
 
node = llist_reverse_order(node);
/* make sure flag is seen after deletion */
diff --git a/include/linux/sched/vhost_task.h b/include/linux/sched/vhost_task.h
index 6123c10b99cf..54b68115eb3b 100644
--- a/include/linux/sched/vhost_task.h
+++ b/include/linux/sched/vhost_task.h
@@ -19,5 +19,6 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), void 
*arg,
 void vhost_task_start(struct vhost_task *vtsk);
 void vhost_task_stop(struct vhost_task *vtsk);
 bool vhost_task_should_stop(struct vhost_task *vtsk);
+bool vhost_task_get_signal(void);
 
 #endif
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index b7cbd66f889e..a661cfa32ba3 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -61,6 +61,26 @@ bool vhost_task_should_stop(struct vhost_task *vtsk)
 }
 EXPORT_SYMBOL_GPL(vhost_task_should_stop);
 
+/**
+ * vhost_task_get_signal - Check if there are pending signals
+ *
+ * Return true if we got SIGKILL.
+ */
+bool vhost_task_get_signal(void)
+{
+   struct ksignal ksig;
+   bool rc;
+
+   if (!signal_pending(current))
+   return false;
+
+   __set_current_state(TASK_RUNNING);
+   rc = get_signal(&ksig);
+   set_current_state(TASK_INTERRUPTIBLE);
+   return rc;
+}
+EXPORT_SYMBOL_GPL(vhost_task_get_signal);
+
 /**
  * vhost_task_create - create a copy of a process to be used by the kernel
  * @fn: thread stack
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[RFC PATCH 3/8] fork/vhost_task: Switch to CLONE_THREAD and CLONE_SIGHAND

2023-05-17 Thread Mike Christie

This is a modified version of Linus's patch which has vhost_task
use CLONE_THREAD and CLONE_SIGHAND and allow SIGKILL and SIGSTOP.

I renamed the ignore_signals to block_signals based on Linus's comment
where it aligns with what we are doing with the siginitsetinv
p->blocked use and no longer calling ignore_signals.

Signed-off-by: Mike Christie 
---
 include/linux/sched/task.h |  2 +-
 kernel/fork.c  | 12 +++-
 kernel/vhost_task.c|  5 +++--
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..249a5ece9def 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,7 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
-   u32 ignore_signals:1;
+   u32 block_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..9e04ab5c3946 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2338,14 +2338,10 @@ __latent_entropy struct task_struct *copy_process(
p->flags |= PF_KTHREAD;
if (args->user_worker)
p->flags |= PF_USER_WORKER;
-   if (args->io_thread) {
-   /*
-* Mark us an IO worker, and block any signal that isn't
-* fatal or STOP
-*/
+   if (args->io_thread)
p->flags |= PF_IO_WORKER;
+   if (args->block_signals)
siginitsetinv(&p->blocked, sigmask(SIGKILL)|sigmask(SIGSTOP));
-   }
 
if (args->name)
strscpy_pad(p->comm, args->name, sizeof(p->comm));
@@ -2517,9 +2513,6 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
 
-   if (args->ignore_signals)
-   ignore_signals(p);
-
stackleak_task_init(p);
 
if (pid != &init_struct_pid) {
@@ -2861,6 +2854,7 @@ struct task_struct *create_io_thread(int (*fn)(void *), 
void *arg, int node)
.fn_arg = arg,
.io_thread  = 1,
.user_worker= 1,
+   .block_signals  = 1,
};
 
return copy_process(NULL, 0, node, &args);
diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index a661cfa32ba3..a11f036290cc 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -95,13 +95,14 @@ struct vhost_task *vhost_task_create(int (*fn)(void *), 
void *arg,
 const char *name)
 {
struct kernel_clone_args args = {
-   .flags  = CLONE_FS | CLONE_UNTRACED | CLONE_VM,
+   .flags  = CLONE_FS | CLONE_UNTRACED | CLONE_VM |
+ CLONE_THREAD | CLONE_SIGHAND,
.exit_signal= 0,
.fn = vhost_task_fn,
.name   = name,
.user_worker= 1,
.no_files   = 1,
-   .ignore_signals = 1,
+   .block_signals  = 1,
};
struct vhost_task *vtsk;
struct task_struct *tsk;
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

1 2 3 4 5 >

1 - 100 of 409 matches

Mail list logo