Re: [PATCH v6 11/11] vhost: allow userspace to create workers

2023-04-14 Thread Mike Christie
On 4/13/23 9:26 PM, Jason Wang wrote:
> On Fri, Apr 14, 2023 at 6:36 AM Mike Christie
>  wrote:
>>
>> On 4/12/23 2:56 AM, Jason Wang wrote:
 I can spin another patchset with the single ioctl design so we can compare.
>>> So I'm fine with this approach. One last question, I see this:
>>>
>>> /* By default, a device gets one vhost_worker that its virtqueues share. 
>>> This */
>>>
>>> I'm wondering if it is better to have a vhost_get_worker() to return
>>> the worker_id of a specific queue. In the future, this might be useful
>>> for devices that have multiple workers by default?
>>
>> Yeah, it would be useful. Just 2 questions. You mean an ioctl command to get
>> this info right? I had added one in one version a while back, but dropped it
>> because for some reason I thought we were going to do a generic vhost sysfs
>> type of interface. We are not doing sysfs right?
> 
> Probably, but we need to make sure the emulatorpin can work for libvirt at 
> last.
> 

Yeah, it works. You had mentioned that to me a couple years ago and I use 
commands
like

virsh emulatorpin

in testing.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v6 11/11] vhost: allow userspace to create workers

2023-04-13 Thread Jason Wang
On Fri, Apr 14, 2023 at 6:36 AM Mike Christie
 wrote:
>
> On 4/12/23 2:56 AM, Jason Wang wrote:
> >> I can spin another patchset with the single ioctl design so we can compare.
> > So I'm fine with this approach. One last question, I see this:
> >
> > /* By default, a device gets one vhost_worker that its virtqueues share. 
> > This */
> >
> > I'm wondering if it is better to have a vhost_get_worker() to return
> > the worker_id of a specific queue. In the future, this might be useful
> > for devices that have multiple workers by default?
>
> Yeah, it would be useful. Just 2 questions. You mean an ioctl command to get
> this info right? I had added one in one version a while back, but dropped it
> because for some reason I thought we were going to do a generic vhost sysfs
> type of interface. We are not doing sysfs right?

Probably, but we need to make sure the emulatorpin can work for libvirt at last.

Thanks

>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v6 11/11] vhost: allow userspace to create workers

2023-04-13 Thread Mike Christie
On 4/12/23 2:56 AM, Jason Wang wrote:
>> I can spin another patchset with the single ioctl design so we can compare.
> So I'm fine with this approach. One last question, I see this:
> 
> /* By default, a device gets one vhost_worker that its virtqueues share. This 
> */
> 
> I'm wondering if it is better to have a vhost_get_worker() to return
> the worker_id of a specific queue. In the future, this might be useful
> for devices that have multiple workers by default?

Yeah, it would be useful. Just 2 questions. You mean an ioctl command to get
this info right? I had added one in one version a while back, but dropped it
because for some reason I thought we were going to do a generic vhost sysfs
type of interface. We are not doing sysfs right?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v6 11/11] vhost: allow userspace to create workers

2023-04-12 Thread Jason Wang
On Wed, Apr 12, 2023 at 6:15 AM Mike Christie
 wrote:
>
> On 4/10/23 10:00 PM, Jason Wang wrote:
> >>> vhost_zerocopy_callback(). But since you want to limit the call before
> >>> set_backend, another question comes, do we really need the dynamic
> >>> attaching/creating in this case? How about a simple one ioctl that is
> >>> used to build the queue to workers mapping instead?
> >>
> >>
> >> I didn't think we need the dynamic case. It was from a review comment.
> >
> > Right, so we actually don't need three new ioctls but only a single is
> > sufficient?
> >
> > struct vhost_worker_state {
> >   __u16 workers;
> >   __u16 queue_to_work_map[];
> > };
> >
> > And limiting this to be called before datapath can run is sufficient?
> > (sorry I missed some of the previous comments).
>
> It's been like 3 years since this was last discussed so no problem :)
>

I'm really sorry for that, -ENOMEM :(

> Yeah, what you describe is all I need. Originally I just had the one
> ioctl:
>
> +#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct 
> vhost_vring_worker)
>
> The VHOST_SET_VRING_WORKER created a worker on the virtqueue in the
> vhost_vring_worker.
>
>
> 
> >> -   default:
> >> +   case VHOST_ATTACH_VRING_WORKER:
> >> +   if (copy_from_user(, argp, sizeof(w))) {
> >> +   r = -EFAULT;
> >> +   break;
> >> +   }
> >> +   r = vhost_vq_attach_worker(vq, );
> >> +   if (!r && copy_to_user(argp, , sizeof(w)))
> >> +   r = -EFAULT;
> >> +   break;
> >
> > It's a little odd that we have new and attach but only a free.
> 
>  Do you mean we would also want a detach? I've been debating that.
>  I'm not sure what the user wanted though.
> 
>  Would it leave the virtqueue with no worker? Or, does it pick the first
>  worker it finds? If there is no worker because we did the former or
>  because userspace detached all of them, then do we just drop work that
>  gets queued?
> 
>  It seemed like a user would never want to drop work, so I made it so
>  you can only tell it to use new workers (attach which I guess is more
>  like a swap worker)
> >>>
> >>> swap and free old worker indeed?
> >>
> >> I didn't understand the question.
> >
> > I mean if I understand the code correctly, the code tries to drop
> > refcnt and free the old worker during attach.
>
> I see. We actually don't free until VHOST_FREE_WORKER.
>
> When we create the worker from VHOST_NEW_WORKER we set the refcount
> to 1. Then each time a virtqueue and worker are attached to each other
> we increase the refcount.
>
> When you do vhost_vq_detach_worker then it drops the refcount from the
> attach. Then if you detached the worker from all the virtqueues you
> still have the refcount=1 from the VHOST_NEW_WORKER call.
>
> The refcount decrement in VHOST_FREE_WORKER via vhost_worker_put would
> be the final put. Note that if userspace just closes the dev without
> doing a put, then we do the final puts for it.

Right, I mis-read the code, you did

/*
 * We can free the worker if it's not attached to any virtqueues.
 */
if (refcount_read(>refcount) != 1)
return -EBUSY;

And I thought it was a dec and test.

>
> Note that I originally didn't have the free. I don't need it for any
> specific reason. It was just from a review comment. I originally just had
> the one create worker type of ioctl. Then it was suggested to do the split
> attach/new/free design.

I see.

>
> I can spin another patchset with the single ioctl design so we can compare.

So I'm fine with this approach. One last question, I see this:

/* By default, a device gets one vhost_worker that its virtqueues share. This */

I'm wondering if it is better to have a vhost_get_worker() to return
the worker_id of a specific queue. In the future, this might be useful
for devices that have multiple workers by default?

Thanks

>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v6 11/11] vhost: allow userspace to create workers

2023-04-11 Thread Mike Christie
On 4/10/23 10:00 PM, Jason Wang wrote:
>>> vhost_zerocopy_callback(). But since you want to limit the call before
>>> set_backend, another question comes, do we really need the dynamic
>>> attaching/creating in this case? How about a simple one ioctl that is
>>> used to build the queue to workers mapping instead?
>>
>>
>> I didn't think we need the dynamic case. It was from a review comment.
> 
> Right, so we actually don't need three new ioctls but only a single is
> sufficient?
> 
> struct vhost_worker_state {
>   __u16 workers;
>   __u16 queue_to_work_map[];
> };
> 
> And limiting this to be called before datapath can run is sufficient?
> (sorry I missed some of the previous comments).

It's been like 3 years since this was last discussed so no problem :)

Yeah, what you describe is all I need. Originally I just had the one
ioctl:

+#define VHOST_SET_VRING_WORKER _IOWR(VHOST_VIRTIO, 0x15, struct 
vhost_vring_worker)

The VHOST_SET_VRING_WORKER created a worker on the virtqueue in the
vhost_vring_worker.



>> -   default:
>> +   case VHOST_ATTACH_VRING_WORKER:
>> +   if (copy_from_user(, argp, sizeof(w))) {
>> +   r = -EFAULT;
>> +   break;
>> +   }
>> +   r = vhost_vq_attach_worker(vq, );
>> +   if (!r && copy_to_user(argp, , sizeof(w)))
>> +   r = -EFAULT;
>> +   break;
>
> It's a little odd that we have new and attach but only a free.

 Do you mean we would also want a detach? I've been debating that.
 I'm not sure what the user wanted though.

 Would it leave the virtqueue with no worker? Or, does it pick the first
 worker it finds? If there is no worker because we did the former or
 because userspace detached all of them, then do we just drop work that
 gets queued?

 It seemed like a user would never want to drop work, so I made it so
 you can only tell it to use new workers (attach which I guess is more
 like a swap worker)
>>>
>>> swap and free old worker indeed?
>>
>> I didn't understand the question.
> 
> I mean if I understand the code correctly, the code tries to drop
> refcnt and free the old worker during attach.

I see. We actually don't free until VHOST_FREE_WORKER.

When we create the worker from VHOST_NEW_WORKER we set the refcount
to 1. Then each time a virtqueue and worker are attached to each other
we increase the refcount.

When you do vhost_vq_detach_worker then it drops the refcount from the
attach. Then if you detached the worker from all the virtqueues you
still have the refcount=1 from the VHOST_NEW_WORKER call.

The refcount decrement in VHOST_FREE_WORKER via vhost_worker_put would
be the final put. Note that if userspace just closes the dev without
doing a put, then we do the final puts for it.

Note that I originally didn't have the free. I don't need it for any
specific reason. It was just from a review comment. I originally just had
the one create worker type of ioctl. Then it was suggested to do the split
attach/new/free design.

I can spin another patchset with the single ioctl design so we can compare.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v6 11/11] vhost: allow userspace to create workers

2023-04-10 Thread Jason Wang
On Tue, Apr 11, 2023 at 1:16 AM Mike Christie
 wrote:
>
> On 4/10/23 2:04 AM, Jason Wang wrote:
> > On Wed, Apr 5, 2023 at 7:08 AM Mike Christie
> >  wrote:
> >>
> >> On 4/4/23 3:00 AM, Jason Wang wrote:
> 
>  -static void vhost_worker_free(struct vhost_dev *dev)
>  +static void vhost_worker_put(struct vhost_dev *dev, struct vhost_worker 
>  *worker)
>   {
>  -   struct vhost_worker *worker = dev->worker;
>  -
>  if (!worker)
>  return;
> 
>  -   dev->worker = NULL;
>  +   if (!refcount_dec_and_test(>refcount))
>  +   return;
>  +
>  WARN_ON(!llist_empty(>work_list));
>  vhost_task_stop(worker->vtsk);
>  kfree(worker);
>   }
> 
>  +static void vhost_vq_detach_worker(struct vhost_virtqueue *vq)
>  +{
>  +   if (vq->worker)
> >>>
> >>> What happens to the pending work that queues for the old worker?
> >>
> >> I didn't think there would be works queued at this time. I see your comment
> >> below about my assumption about the backend being set being wrong. Will
> >> discuss down there.
> >>
> >>
> 
>  +/* Caller must have device and virtqueue mutex */
>  +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
>  +struct vhost_worker *worker)
>  +{
>  +   refcount_inc(>refcount);
>  +   vhost_vq_detach_worker(vq);())
>  +   vq->worker = worker;
> >>>
> >>> What happens if there's a pending flush in a specific device (e.g in
> >>> vhost_scsi_tmf_resp_work())?
> >>
> >> We wouldn't hit that specific case where we are running the above code and
> >> vhost_scsi_tmf_resp_work.
> >>
> >> Either:
> >>
> >> 1. The backend is NULL and has never been set, and so we would never have
> >> called vhost_scsi_tmf_resp_work, because it can only be called after the
> >> backend is set.
> >>
> >> 2. If the backed has been set and vhost_scsi_tmf_resp_work has
> >> run or is running, then we when we would not have called 
> >> __vhost_vq_attach_worker
> >> from vhost_vq_attach_worker because it would see vhost_vq_get_backend
> >> returning a non-NULL value.
> >>
> >> If vhost_scsi later sets the backend to NULL, then 
> >> vhost_scsi_clear_endpoint
> >> will have made sure the flush has completed when the clear function 
> >> returns.
> >> It does that with the device mutex so when we run __vhost_vq_attach_worker
> >> It will only see a vq/worker with no flushes in progress.
> >
> > Ok.
> >
> >>
> >> For the general case of can we be doing a 
> >> vhost_dev_flush/vhost_work_flush_on
> >> and __vhost_vq_attach_worker, then I thought we are ok as well because I
> >> thought we have to currently have the device mutex when we flush so those 
> >> can't
> >> race with ioctl calls to vhost_vq_attach_worker since we hold the dev mutex
> >> during that ioctls.
> >
> > I'm not sure I understand here, but we can't assume the user of
> > vhost_work_queue() is called in the ioctl context. See
>
> The comment above was only to answer the question about 
> __vhost_vq_attach_worker
> racing with works queued from a flush.
>
> In general, I thought we did something to stop IO with the vq mutex
> (clear the backend stop polling, etc) then do vhost_dev_flush under the
> dev mutex. So once the flush is done there would not be any new 
> vhost_work_queue
> calls for the virtqueue.
>
> For vhost_zerocopy_callback case, when can handle_tx_zerocopy start to be 
> called?
> It looked like it's only called after the backend is set because handle_tx
> checks for a non-null backend.

You are right, again, what I want to say is, if the patch adds new
assumptions, we should document it.

>
>
>
>
> > vhost_zerocopy_callback(). But since you want to limit the call before
> > set_backend, another question comes, do we really need the dynamic
> > attaching/creating in this case? How about a simple one ioctl that is
> > used to build the queue to workers mapping instead?
>
>
> I didn't think we need the dynamic case. It was from a review comment.

Right, so we actually don't need three new ioctls but only a single is
sufficient?

struct vhost_worker_state {
  __u16 workers;
  __u16 queue_to_work_map[];
};

And limiting this to be called before datapath can run is sufficient?
(sorry I missed some of the previous comments).

>
> See at the very bottom for more info, because it's related to your
> free worker question.
>
>
> >
> >> Or we call flush from the file_operations release function
> >> so the device is closed and can't race with ioctls.
> >>
> >>>
> >>> Does this mean we need to hold vq mutex when doing the flush? (which
> >>> seems not the case of vhost_scsi_tmf_resp_work()).
> >>>
>  +}
>  +
>  +/* Caller must have device and virtqueue mutex */
>  +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq,
>  + struct vhost_vring_worker *info)

Re: [PATCH v6 11/11] vhost: allow userspace to create workers

2023-04-10 Thread Mike Christie
On 4/10/23 2:04 AM, Jason Wang wrote:
> On Wed, Apr 5, 2023 at 7:08 AM Mike Christie
>  wrote:
>>
>> On 4/4/23 3:00 AM, Jason Wang wrote:

 -static void vhost_worker_free(struct vhost_dev *dev)
 +static void vhost_worker_put(struct vhost_dev *dev, struct vhost_worker 
 *worker)
  {
 -   struct vhost_worker *worker = dev->worker;
 -
 if (!worker)
 return;

 -   dev->worker = NULL;
 +   if (!refcount_dec_and_test(>refcount))
 +   return;
 +
 WARN_ON(!llist_empty(>work_list));
 vhost_task_stop(worker->vtsk);
 kfree(worker);
  }

 +static void vhost_vq_detach_worker(struct vhost_virtqueue *vq)
 +{
 +   if (vq->worker)
>>>
>>> What happens to the pending work that queues for the old worker?
>>
>> I didn't think there would be works queued at this time. I see your comment
>> below about my assumption about the backend being set being wrong. Will
>> discuss down there.
>>
>>

 +/* Caller must have device and virtqueue mutex */
 +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
 +struct vhost_worker *worker)
 +{
 +   refcount_inc(>refcount);
 +   vhost_vq_detach_worker(vq);())
 +   vq->worker = worker;
>>>
>>> What happens if there's a pending flush in a specific device (e.g in
>>> vhost_scsi_tmf_resp_work())?
>>
>> We wouldn't hit that specific case where we are running the above code and
>> vhost_scsi_tmf_resp_work.
>>
>> Either:
>>
>> 1. The backend is NULL and has never been set, and so we would never have
>> called vhost_scsi_tmf_resp_work, because it can only be called after the
>> backend is set.
>>
>> 2. If the backed has been set and vhost_scsi_tmf_resp_work has
>> run or is running, then we when we would not have called 
>> __vhost_vq_attach_worker
>> from vhost_vq_attach_worker because it would see vhost_vq_get_backend
>> returning a non-NULL value.
>>
>> If vhost_scsi later sets the backend to NULL, then vhost_scsi_clear_endpoint
>> will have made sure the flush has completed when the clear function returns.
>> It does that with the device mutex so when we run __vhost_vq_attach_worker
>> It will only see a vq/worker with no flushes in progress.
> 
> Ok.
> 
>>
>> For the general case of can we be doing a vhost_dev_flush/vhost_work_flush_on
>> and __vhost_vq_attach_worker, then I thought we are ok as well because I
>> thought we have to currently have the device mutex when we flush so those 
>> can't
>> race with ioctl calls to vhost_vq_attach_worker since we hold the dev mutex
>> during that ioctls.
> 
> I'm not sure I understand here, but we can't assume the user of
> vhost_work_queue() is called in the ioctl context. See

The comment above was only to answer the question about __vhost_vq_attach_worker
racing with works queued from a flush.

In general, I thought we did something to stop IO with the vq mutex
(clear the backend stop polling, etc) then do vhost_dev_flush under the
dev mutex. So once the flush is done there would not be any new vhost_work_queue
calls for the virtqueue.

For vhost_zerocopy_callback case, when can handle_tx_zerocopy start to be 
called?
It looked like it's only called after the backend is set because handle_tx
checks for a non-null backend.




> vhost_zerocopy_callback(). But since you want to limit the call before
> set_backend, another question comes, do we really need the dynamic
> attaching/creating in this case? How about a simple one ioctl that is
> used to build the queue to workers mapping instead?


I didn't think we need the dynamic case. It was from a review comment.

See at the very bottom for more info, because it's related to your
free worker question.


> 
>> Or we call flush from the file_operations release function
>> so the device is closed and can't race with ioctls.
>>
>>>
>>> Does this mean we need to hold vq mutex when doing the flush? (which
>>> seems not the case of vhost_scsi_tmf_resp_work()).
>>>
 +}
 +
 +/* Caller must have device and virtqueue mutex */
 +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq,
 + struct vhost_vring_worker *info)
 +{
 +   unsigned long index = info->worker_id;
 +   struct vhost_dev *dev = vq->dev;
 +   struct vhost_worker *worker;
 +
 +   if (!dev->use_worker)
 +   return -EINVAL;
 +
 +   /*
 +* We don't support setting a worker on an active vq to make 
 flushing
 +* and removal simple.
 +*/
 +   if (vhost_vq_get_backend(vq))
 +   return -EBUSY;
>>>
>>> This assumes the worker won't be started until the backend is set
>>> which is not true.
>>
>> I can see how we get flushes before setting the backend, but I thought we are
>> ok because we have the device mutex held.
>>

Re: [PATCH v6 11/11] vhost: allow userspace to create workers

2023-04-10 Thread Jason Wang
On Wed, Apr 5, 2023 at 7:08 AM Mike Christie
 wrote:
>
> On 4/4/23 3:00 AM, Jason Wang wrote:
> >>
> >> -static void vhost_worker_free(struct vhost_dev *dev)
> >> +static void vhost_worker_put(struct vhost_dev *dev, struct vhost_worker 
> >> *worker)
> >>  {
> >> -   struct vhost_worker *worker = dev->worker;
> >> -
> >> if (!worker)
> >> return;
> >>
> >> -   dev->worker = NULL;
> >> +   if (!refcount_dec_and_test(>refcount))
> >> +   return;
> >> +
> >> WARN_ON(!llist_empty(>work_list));
> >> vhost_task_stop(worker->vtsk);
> >> kfree(worker);
> >>  }
> >>
> >> +static void vhost_vq_detach_worker(struct vhost_virtqueue *vq)
> >> +{
> >> +   if (vq->worker)
> >
> > What happens to the pending work that queues for the old worker?
>
> I didn't think there would be works queued at this time. I see your comment
> below about my assumption about the backend being set being wrong. Will
> discuss down there.
>
>
> >>
> >> +/* Caller must have device and virtqueue mutex */
> >> +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
> >> +struct vhost_worker *worker)
> >> +{
> >> +   refcount_inc(>refcount);
> >> +   vhost_vq_detach_worker(vq);())
> >> +   vq->worker = worker;
> >
> > What happens if there's a pending flush in a specific device (e.g in
> > vhost_scsi_tmf_resp_work())?
>
> We wouldn't hit that specific case where we are running the above code and
> vhost_scsi_tmf_resp_work.
>
> Either:
>
> 1. The backend is NULL and has never been set, and so we would never have
> called vhost_scsi_tmf_resp_work, because it can only be called after the
> backend is set.
>
> 2. If the backed has been set and vhost_scsi_tmf_resp_work has
> run or is running, then we when we would not have called 
> __vhost_vq_attach_worker
> from vhost_vq_attach_worker because it would see vhost_vq_get_backend
> returning a non-NULL value.
>
> If vhost_scsi later sets the backend to NULL, then vhost_scsi_clear_endpoint
> will have made sure the flush has completed when the clear function returns.
> It does that with the device mutex so when we run __vhost_vq_attach_worker
> It will only see a vq/worker with no flushes in progress.

Ok.

>
> For the general case of can we be doing a vhost_dev_flush/vhost_work_flush_on
> and __vhost_vq_attach_worker, then I thought we are ok as well because I
> thought we have to currently have the device mutex when we flush so those 
> can't
> race with ioctl calls to vhost_vq_attach_worker since we hold the dev mutex
> during that ioctls.

I'm not sure I understand here, but we can't assume the user of
vhost_work_queue() is called in the ioctl context. See
vhost_zerocopy_callback(). But since you want to limit the call before
set_backend, another question comes, do we really need the dynamic
attaching/creating in this case? How about a simple one ioctl that is
used to build the queue to workers mapping instead?

> Or we call flush from the file_operations release function
> so the device is closed and can't race with ioctls.
>
> >
> > Does this mean we need to hold vq mutex when doing the flush? (which
> > seems not the case of vhost_scsi_tmf_resp_work()).
> >
> >> +}
> >> +
> >> +/* Caller must have device and virtqueue mutex */
> >> +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq,
> >> + struct vhost_vring_worker *info)
> >> +{
> >> +   unsigned long index = info->worker_id;
> >> +   struct vhost_dev *dev = vq->dev;
> >> +   struct vhost_worker *worker;
> >> +
> >> +   if (!dev->use_worker)
> >> +   return -EINVAL;
> >> +
> >> +   /*
> >> +* We don't support setting a worker on an active vq to make 
> >> flushing
> >> +* and removal simple.
> >> +*/
> >> +   if (vhost_vq_get_backend(vq))
> >> +   return -EBUSY;
> >
> > This assumes the worker won't be started until the backend is set
> > which is not true.
>
> I can see how we get flushes before setting the backend, but I thought we are
> ok because we have the device mutex held.
>
> What are the other possible cases? Is one case we could get a
> VHOST_SET_VRING_KICK before setting the backend and so vhost_poll_start starts
> to allow vhost_poll_queue before the backend is set?

Yes, and it's not necessarily the kick, the networking core could wake
up workers before set_backend.

> Is there any thing else?

Haven't found time to do this, but what I want to say is, if we want
this assumption, we need to document it and change the devices that
are affected by this change.

>
> I'll fix this issue.
>
>
> >> +
> >> +/* Caller must have device mutex */
> >> +static int vhost_free_worker(struct vhost_dev *dev,
> >> +struct vhost_worker_state *info)
> >> +{
> >> +   unsigned long index = info->worker_id;
> >> +   struct vhost_worker *worker;
> >> +
> >> +   if 

Re: [PATCH v6 11/11] vhost: allow userspace to create workers

2023-04-04 Thread Mike Christie
On 4/4/23 3:00 AM, Jason Wang wrote:
>>
>> -static void vhost_worker_free(struct vhost_dev *dev)
>> +static void vhost_worker_put(struct vhost_dev *dev, struct vhost_worker 
>> *worker)
>>  {
>> -   struct vhost_worker *worker = dev->worker;
>> -
>> if (!worker)
>> return;
>>
>> -   dev->worker = NULL;
>> +   if (!refcount_dec_and_test(>refcount))
>> +   return;
>> +
>> WARN_ON(!llist_empty(>work_list));
>> vhost_task_stop(worker->vtsk);
>> kfree(worker);
>>  }
>>
>> +static void vhost_vq_detach_worker(struct vhost_virtqueue *vq)
>> +{
>> +   if (vq->worker)
> 
> What happens to the pending work that queues for the old worker?

I didn't think there would be works queued at this time. I see your comment
below about my assumption about the backend being set being wrong. Will
discuss down there.


>>
>> +/* Caller must have device and virtqueue mutex */
>> +static void __vhost_vq_attach_worker(struct vhost_virtqueue *vq,
>> +struct vhost_worker *worker)
>> +{
>> +   refcount_inc(>refcount);
>> +   vhost_vq_detach_worker(vq);())
>> +   vq->worker = worker;
> 
> What happens if there's a pending flush in a specific device (e.g in
> vhost_scsi_tmf_resp_work())?

We wouldn't hit that specific case where we are running the above code and
vhost_scsi_tmf_resp_work.

Either:

1. The backend is NULL and has never been set, and so we would never have
called vhost_scsi_tmf_resp_work, because it can only be called after the
backend is set.

2. If the backed has been set and vhost_scsi_tmf_resp_work has
run or is running, then we when we would not have called 
__vhost_vq_attach_worker
from vhost_vq_attach_worker because it would see vhost_vq_get_backend
returning a non-NULL value.

If vhost_scsi later sets the backend to NULL, then vhost_scsi_clear_endpoint
will have made sure the flush has completed when the clear function returns.
It does that with the device mutex so when we run __vhost_vq_attach_worker
It will only see a vq/worker with no flushes in progress.

For the general case of can we be doing a vhost_dev_flush/vhost_work_flush_on
and __vhost_vq_attach_worker, then I thought we are ok as well because I
thought we have to currently have the device mutex when we flush so those can't
race with ioctl calls to vhost_vq_attach_worker since we hold the dev mutex
during that ioctls. Or we call flush from the file_operations release function 
so the device is closed and can't race with ioctls.

> 
> Does this mean we need to hold vq mutex when doing the flush? (which
> seems not the case of vhost_scsi_tmf_resp_work()).
> 
>> +}
>> +
>> +/* Caller must have device and virtqueue mutex */
>> +static int vhost_vq_attach_worker(struct vhost_virtqueue *vq,
>> + struct vhost_vring_worker *info)
>> +{
>> +   unsigned long index = info->worker_id;
>> +   struct vhost_dev *dev = vq->dev;
>> +   struct vhost_worker *worker;
>> +
>> +   if (!dev->use_worker)
>> +   return -EINVAL;
>> +
>> +   /*
>> +* We don't support setting a worker on an active vq to make flushing
>> +* and removal simple.
>> +*/
>> +   if (vhost_vq_get_backend(vq))
>> +   return -EBUSY;
> 
> This assumes the worker won't be started until the backend is set
> which is not true.

I can see how we get flushes before setting the backend, but I thought we are
ok because we have the device mutex held.

What are the other possible cases? Is one case we could get a
VHOST_SET_VRING_KICK before setting the backend and so vhost_poll_start starts
to allow vhost_poll_queue before the backend is set? Is there any thing else?

I'll fix this issue.


>> +
>> +/* Caller must have device mutex */
>> +static int vhost_free_worker(struct vhost_dev *dev,
>> +struct vhost_worker_state *info)
>> +{
>> +   unsigned long index = info->worker_id;
>> +   struct vhost_worker *worker;
>> +
>> +   if (!dev->use_worker)
>> +   return -EINVAL;
>> +
>> +   worker = xa_find(>worker_xa, , UINT_MAX, XA_PRESENT);
> 
> So we use int for worker_id which conflicts with UINT_MAX here?

I switched from idr in the last versions to xa last second and added this 
mistake.
Will fix.


> 
> struct vhost_worker_state {
> /*
>  * For VHOST_NEW_WORKER the kernel will return the new vhost_worker 
> id.
>  * For VHOST_FREE_WORKER this must be set to the id of the 
> vhost_worker
>  * to free.
>  */
> int worker_id;
> };
> 
> A side effect of using xa_index directly is that userspace can guess
> the xa_index of the default worker and free it here, is this intended?
I don't need the functionality. It was only something that I didn't
think mattered much, because you can only free the worker if it's not in
use by any virtqueues, so I didn't add any special code to handle it.
The user would 

Re: [PATCH v6 11/11] vhost: allow userspace to create workers

2023-04-04 Thread Jason Wang
On Tue, Mar 28, 2023 at 10:17 AM Mike Christie
 wrote:
>
> For vhost-scsi with 3 vqs and a workload like that tries to use those vqs
> like:
>
> fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
> --ioengine=libaio --iodepth=128  --numjobs=3
>
> the single vhost worker thread will become a bottlneck and we are stuck
> at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are
> used.
>
> To better utilize virtqueues and available CPUs, this patch allows
> userspace to create workers and bind them to vqs. You can have N workers
> per dev and also share N workers with M vqs.
>
> With the patches and doing a worker per vq, we can scale to at least
> 16 vCPUs/vqs (that's my system limit) with the same command fio command
> above with numjobs=16:
>
> fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
> --ioengine=libaio --iodepth=64  --numjobs=16
>
> which gives around 2326K IOPs.
>
> Note that for testing I dropped depth to 64 above because the vhost/virt
> layer supports only 1024 total commands per device. And the only tuning I
> did was set LIO's emulate_pr to 0 to avoid LIO's PR lock in the main IO
> path which becomes an issue at around 12 jobs/virtqueues.
>
> Signed-off-by: Mike Christie 
> ---
>  drivers/vhost/vhost.c| 177 ---
>  drivers/vhost/vhost.h|   4 +-
>  include/uapi/linux/vhost.h   |  22 
>  include/uapi/linux/vhost_types.h |  15 +++
>  4 files changed, 204 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 1fa5e9a49092..e40699e83c6d 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -271,7 +271,11 @@ EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
>
>  void vhost_dev_flush(struct vhost_dev *dev)
>  {
> -   vhost_work_flush_on(dev->worker);
> +   struct vhost_worker *worker;
> +   unsigned long i;
> +
> +   xa_for_each(>worker_xa, i, worker)
> +   vhost_work_flush_on(worker);
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_flush);
>
> @@ -489,7 +493,6 @@ void vhost_dev_init(struct vhost_dev *dev,
> dev->umem = NULL;
> dev->iotlb = NULL;
> dev->mm = NULL;
> -   dev->worker = NULL;
> dev->iov_limit = iov_limit;
> dev->weight = weight;
> dev->byte_weight = byte_weight;
> @@ -499,7 +502,7 @@ void vhost_dev_init(struct vhost_dev *dev,
> INIT_LIST_HEAD(>read_list);
> INIT_LIST_HEAD(>pending_list);
> spin_lock_init(>iotlb_lock);
> -
> +   xa_init_flags(>worker_xa, XA_FLAGS_ALLOC);
>
> for (i = 0; i < dev->nvqs; ++i) {
> vq = dev->vqs[i];
> @@ -562,32 +565,67 @@ static void vhost_detach_mm(struct vhost_dev *dev)
> dev->mm = NULL;
>  }
>
> -static void vhost_worker_free(struct vhost_dev *dev)
> +static void vhost_worker_put(struct vhost_dev *dev, struct vhost_worker 
> *worker)
>  {
> -   struct vhost_worker *worker = dev->worker;
> -
> if (!worker)
> return;
>
> -   dev->worker = NULL;
> +   if (!refcount_dec_and_test(>refcount))
> +   return;
> +
> WARN_ON(!llist_empty(>work_list));
> vhost_task_stop(worker->vtsk);
> kfree(worker);
>  }
>
> +static void vhost_vq_detach_worker(struct vhost_virtqueue *vq)
> +{
> +   if (vq->worker)

What happens to the pending work that queues for the old worker?

> +   vhost_worker_put(vq->dev, vq->worker);
> +   vq->worker = NULL;
> +}
> +
> +static void vhost_workers_free(struct vhost_dev *dev)
> +{
> +   struct vhost_worker *worker;
> +   unsigned long i;
> +
> +   if (!dev->use_worker)
> +   return;
> +
> +   for (i = 0; i < dev->nvqs; i++)
> +   vhost_vq_detach_worker(dev->vqs[i]);
> +   /*
> +* Drop the refcount taken during allocation, and handle the default
> +* worker and the cases where userspace might have crashed or was lazy
> +* and did a VHOST_NEW_WORKER but not a VHOST_FREE_WORKER.
> +*/
> +   xa_for_each(>worker_xa, i, worker) {
> +   xa_erase(>worker_xa, worker->id);
> +   vhost_worker_put(dev, worker);
> +   }
> +   xa_destroy(>worker_xa);
> +}
> +
>  static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
>  {
> struct vhost_worker *worker;
> struct vhost_task *vtsk;
> char name[TASK_COMM_LEN];
> +   int ret;
> +   u32 id;
>
> worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
> if (!worker)
> return NULL;
>
> -   dev->worker = worker;
> worker->kcov_handle = kcov_common_handle();
> init_llist_head(>work_list);
> +   /*
> +* We increase the refcount for the initial creation and then
> +* later each time it's attached to a virtqueue.
> +*/
> +   refcount_set(>refcount, 1);
> snprintf(name, sizeof(name), "vhost-%d", current->pid);
>
> 

[PATCH v6 11/11] vhost: allow userspace to create workers

2023-03-27 Thread Mike Christie
For vhost-scsi with 3 vqs and a workload like that tries to use those vqs
like:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=128  --numjobs=3

the single vhost worker thread will become a bottlneck and we are stuck
at around 500K IOPs no matter how many jobs, virtqueues, and CPUs are
used.

To better utilize virtqueues and available CPUs, this patch allows
userspace to create workers and bind them to vqs. You can have N workers
per dev and also share N workers with M vqs.

With the patches and doing a worker per vq, we can scale to at least
16 vCPUs/vqs (that's my system limit) with the same command fio command
above with numjobs=16:

fio --filename=/dev/sdb  --direct=1 --rw=randrw --bs=4k \
--ioengine=libaio --iodepth=64  --numjobs=16

which gives around 2326K IOPs.

Note that for testing I dropped depth to 64 above because the vhost/virt
layer supports only 1024 total commands per device. And the only tuning I
did was set LIO's emulate_pr to 0 to avoid LIO's PR lock in the main IO
path which becomes an issue at around 12 jobs/virtqueues.

Signed-off-by: Mike Christie 
---
 drivers/vhost/vhost.c| 177 ---
 drivers/vhost/vhost.h|   4 +-
 include/uapi/linux/vhost.h   |  22 
 include/uapi/linux/vhost_types.h |  15 +++
 4 files changed, 204 insertions(+), 14 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 1fa5e9a49092..e40699e83c6d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -271,7 +271,11 @@ EXPORT_SYMBOL_GPL(vhost_vq_work_queue);
 
 void vhost_dev_flush(struct vhost_dev *dev)
 {
-   vhost_work_flush_on(dev->worker);
+   struct vhost_worker *worker;
+   unsigned long i;
+
+   xa_for_each(>worker_xa, i, worker)
+   vhost_work_flush_on(worker);
 }
 EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
@@ -489,7 +493,6 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   dev->worker = NULL;
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -499,7 +502,7 @@ void vhost_dev_init(struct vhost_dev *dev,
INIT_LIST_HEAD(>read_list);
INIT_LIST_HEAD(>pending_list);
spin_lock_init(>iotlb_lock);
-
+   xa_init_flags(>worker_xa, XA_FLAGS_ALLOC);
 
for (i = 0; i < dev->nvqs; ++i) {
vq = dev->vqs[i];
@@ -562,32 +565,67 @@ static void vhost_detach_mm(struct vhost_dev *dev)
dev->mm = NULL;
 }
 
-static void vhost_worker_free(struct vhost_dev *dev)
+static void vhost_worker_put(struct vhost_dev *dev, struct vhost_worker 
*worker)
 {
-   struct vhost_worker *worker = dev->worker;
-
if (!worker)
return;
 
-   dev->worker = NULL;
+   if (!refcount_dec_and_test(>refcount))
+   return;
+
WARN_ON(!llist_empty(>work_list));
vhost_task_stop(worker->vtsk);
kfree(worker);
 }
 
+static void vhost_vq_detach_worker(struct vhost_virtqueue *vq)
+{
+   if (vq->worker)
+   vhost_worker_put(vq->dev, vq->worker);
+   vq->worker = NULL;
+}
+
+static void vhost_workers_free(struct vhost_dev *dev)
+{
+   struct vhost_worker *worker;
+   unsigned long i;
+
+   if (!dev->use_worker)
+   return;
+
+   for (i = 0; i < dev->nvqs; i++)
+   vhost_vq_detach_worker(dev->vqs[i]);
+   /*
+* Drop the refcount taken during allocation, and handle the default
+* worker and the cases where userspace might have crashed or was lazy
+* and did a VHOST_NEW_WORKER but not a VHOST_FREE_WORKER.
+*/
+   xa_for_each(>worker_xa, i, worker) {
+   xa_erase(>worker_xa, worker->id);
+   vhost_worker_put(dev, worker);
+   }
+   xa_destroy(>worker_xa);
+}
+
 static struct vhost_worker *vhost_worker_create(struct vhost_dev *dev)
 {
struct vhost_worker *worker;
struct vhost_task *vtsk;
char name[TASK_COMM_LEN];
+   int ret;
+   u32 id;
 
worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
if (!worker)
return NULL;
 
-   dev->worker = worker;
worker->kcov_handle = kcov_common_handle();
init_llist_head(>work_list);
+   /*
+* We increase the refcount for the initial creation and then
+* later each time it's attached to a virtqueue.
+*/
+   refcount_set(>refcount, 1);
snprintf(name, sizeof(name), "vhost-%d", current->pid);
 
vtsk = vhost_task_create(vhost_worker, worker, name);
@@ -596,14 +634,104 @@ static struct vhost_worker *vhost_worker_create(struct 
vhost_dev *dev)
 
worker->vtsk = vtsk;
vhost_task_start(vtsk);
+
+   ret = xa_alloc(>worker_xa, , worker, xa_limit_32b, GFP_KERNEL);
+   if (ret < 0)
+   goto stop_worker;
+   worker->id = id;
+