Re: [PATCH] virtio_ring: validate used buffer length

2023-05-31 Thread Michael S. Tsirkin
On Thu, Jun 01, 2023 at 09:27:08AM +0800, Jason Wang wrote:
> On Wed, May 31, 2023 at 6:25 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, May 31, 2023 at 04:26:38PM +0800, Jason Wang wrote:
> > > On Wed, May 31, 2023 at 3:36 PM Jason Wang  wrote:
> > > >
> > > > On Wed, May 31, 2023 at 1:50 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Wed, May 31, 2023 at 09:05:00AM +0800, Jason Wang wrote:
> > > > > > On Mon, May 29, 2023 at 6:03 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Mon, May 29, 2023 at 09:18:10AM +0800, Jason Wang wrote:
> > > > > > > > On Sun, May 28, 2023 at 3:57 PM Michael S. Tsirkin 
> > > > > > > >  wrote:
> > > > > > > > >
> > > > > > > > > On Fri, May 26, 2023 at 02:30:41PM +0800, Jason Wang wrote:
> > > > > > > > > > This patch validate
> > > > > > > > >
> > > > > > > > > validates
> > > > > > > > >
> > > > > > > > > > the used buffer length provided by the device
> > > > > > > > > > before trying to use it.
> > > > > > > > >
> > > > > > > > > before returning it to caller
> > > > > > > > >
> > > > > > > > > > This is done by remembering the in buffer
> > > > > > > > > > length in a dedicated array during virtqueue_add(), then we 
> > > > > > > > > > can fail
> > > > > > > > > > the virtqueue_get_buf() when we find the device is trying 
> > > > > > > > > > to give us a
> > > > > > > > > > used buffer length which is greater than we stored before.
> > > > > > > > >
> > > > > > > > > than what we stored
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This validation is disable
> > > > > > > > >
> > > > > > > > > disabled
> > > > > > > > >
> > > > > > > > > > by default via module parameter to unbreak
> > > > > > > > > > some existing devices since some legacy devices are known 
> > > > > > > > > > to report
> > > > > > > > > > buggy used length.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Jason Wang 
> > > > > > > > >
> > > > > > > > > First I'm not merging this without more data about
> > > > > > > > > what is known to be broken and what is known to work well
> > > > > > > > > in the commit log. And how exactly do things work if used 
> > > > > > > > > length
> > > > > > > > > is wrong?
> > > > > > > >
> > > > > > > > Assuming the device is malicious, it would be very hard to 
> > > > > > > > answer.
> > > > > > > > Auditing and fuzzing won't cover every case. Instead of trying 
> > > > > > > > to seek
> > > > > > > > the answer, we can simply make sure the used in buffer length is
> > > > > > > > validated then we know we're fine or not.
> > > > > > >
> > > > > > > To restate the question, you said above "some legacy devices are 
> > > > > > > known
> > > > > > > to report buggy used length". If they report buggy length then how
> > > > > > > can things work?
> > > > > >
> > > > > > The validation is disabled for legacy device (as stated in the 
> > > > > > changelog):
> > > > > >
> > > > > > static bool vring_needs_used_validation(const struct virtio_device 
> > > > > > *vdev)
> > > > > > {
> > > > > > /*
> > > > > >  * Several legacy devices are known to produce buggy used
> > > > > >  * length. In order to let driver work, we won't validate 
> > > > > > used
> > > > > >  * buffer length in this case.
> > > > > >  */
> > > > > > if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
> > > > > > return false;
> > > > > > if (force_used_validation)
> > > > > > return true;
> > > > > > return false;
> > > > > > }
> > > > > >
> > > > > > This seems to be what we've agreed in last version:
> > > > > >
> > > > > > https://lore.kernel.org/all/canlsykxfhamuu0bb4j7y6n4_g9odkxlcjxxgxex4sj6_kf+...@mail.gmail.com/T/#m31f3b06f9032beec175c312dfa2532cb08b15c56
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > >
> > > > > I don't get it. You wrote:
> > > > >
> > > > > This validation is disable
> > > > > by default via module parameter to unbreak
> > > > > some existing devices since some legacy devices are known to 
> > > > > report
> > > > > buggy used length.
> > > > >
> > > > > which devices?
> > > >
> > > > legacy rpmsg and vsock device (before 49d8c5ffad07) at least.
> > > >
> > > > > why do you need a module parameter?
> > > >
> > > > If we enable it unconditionally for modern devices, it may break some
> > > > buggy moden device (vsock without a fix as an example).
> > > >
> > > > >
> > > > >
> > > > > > >
> > > > > > > > > Second what's wrong with dma_desc_extra that we already 
> > > > > > > > > maintain?
> > > > > > > > > Third motivation - it's part and parcel of the hardening 
> > > > > > > > > effort yes?
> > > > > > > >
> > > > > > > > They are different. dma_desc_extra is for a descriptor ring, 
> > > > > > > > but this
> > > > > > > > is for a used ring. Technically we can go back to iterate on the
> > > > > > > > descriptor ring for a legal used in buffer length. But it will 
> > > > > > > > have

Re: [PATCH] virtio_ring: validate used buffer length

2023-05-31 Thread Michael S. Tsirkin
On Thu, Jun 01, 2023 at 09:12:53AM +0800, Jason Wang wrote:
> On Wed, May 31, 2023 at 5:55 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, May 31, 2023 at 03:36:51PM +0800, Jason Wang wrote:
> > > On Wed, May 31, 2023 at 1:50 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Wed, May 31, 2023 at 09:05:00AM +0800, Jason Wang wrote:
> > > > > On Mon, May 29, 2023 at 6:03 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, May 29, 2023 at 09:18:10AM +0800, Jason Wang wrote:
> > > > > > > On Sun, May 28, 2023 at 3:57 PM Michael S. Tsirkin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, May 26, 2023 at 02:30:41PM +0800, Jason Wang wrote:
> > > > > > > > > This patch validate
> > > > > > > >
> > > > > > > > validates
> > > > > > > >
> > > > > > > > > the used buffer length provided by the device
> > > > > > > > > before trying to use it.
> > > > > > > >
> > > > > > > > before returning it to caller
> > > > > > > >
> > > > > > > > > This is done by remembering the in buffer
> > > > > > > > > length in a dedicated array during virtqueue_add(), then we 
> > > > > > > > > can fail
> > > > > > > > > the virtqueue_get_buf() when we find the device is trying to 
> > > > > > > > > give us a
> > > > > > > > > used buffer length which is greater than we stored before.
> > > > > > > >
> > > > > > > > than what we stored
> > > > > > > >
> > > > > > > > >
> > > > > > > > > This validation is disable
> > > > > > > >
> > > > > > > > disabled
> > > > > > > >
> > > > > > > > > by default via module parameter to unbreak
> > > > > > > > > some existing devices since some legacy devices are known to 
> > > > > > > > > report
> > > > > > > > > buggy used length.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Jason Wang 
> > > > > > > >
> > > > > > > > First I'm not merging this without more data about
> > > > > > > > what is known to be broken and what is known to work well
> > > > > > > > in the commit log. And how exactly do things work if used length
> > > > > > > > is wrong?
> > > > > > >
> > > > > > > Assuming the device is malicious, it would be very hard to answer.
> > > > > > > Auditing and fuzzing won't cover every case. Instead of trying to 
> > > > > > > seek
> > > > > > > the answer, we can simply make sure the used in buffer length is
> > > > > > > validated then we know we're fine or not.
> > > > > >
> > > > > > To restate the question, you said above "some legacy devices are 
> > > > > > known
> > > > > > to report buggy used length". If they report buggy length then how
> > > > > > can things work?
> > > > >
> > > > > The validation is disabled for legacy device (as stated in the 
> > > > > changelog):
> > > > >
> > > > > static bool vring_needs_used_validation(const struct virtio_device 
> > > > > *vdev)
> > > > > {
> > > > > /*
> > > > >  * Several legacy devices are known to produce buggy used
> > > > >  * length. In order to let driver work, we won't validate used
> > > > >  * buffer length in this case.
> > > > >  */
> > > > > if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
> > > > > return false;
> > > > > if (force_used_validation)
> > > > > return true;
> > > > > return false;
> > > > > }
> > > > >
> > > > > This seems to be what we've agreed in last version:
> > > > >
> > > > > https://lore.kernel.org/all/canlsykxfhamuu0bb4j7y6n4_g9odkxlcjxxgxex4sj6_kf+...@mail.gmail.com/T/#m31f3b06f9032beec175c312dfa2532cb08b15c56
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > > > I don't get it. You wrote:
> > > >
> > > > This validation is disable
> > > > by default via module parameter to unbreak
> > > > some existing devices since some legacy devices are known to 
> > > > report
> > > > buggy used length.
> > > >
> > > > which devices?
> > >
> > > legacy rpmsg and vsock device (before 49d8c5ffad07) at least.
> > >
> > > > why do you need a module parameter?
> > >
> > > If we enable it unconditionally for modern devices, it may break some
> > > buggy moden device (vsock without a fix as an example).
> >
> > Presumably this happens because vsock does not have any inbuf at all
> > so it ignores the length.
> > We had the same with virtio net tx a long time ago.
> >
> > My suggestion is then not to fail - just cap length at the dma length
> > set by driver. Another idea is that if dma len is 0 then don't validate
> > at all - driver that did not add any inbufs is not going to look
> > at length.
> >
> > Allowing passing NULL as length and skipping validation
> > if len = 0 might also be a good idea.
> 
> The above should work for sure, but a question is, in the case of
> confidential computing, if a driver detects a malicious device, is it
> really good to try to keep working or warn and disable the device? Or
> maybe we can have an option for the user to decide?

A minor bug in an unused field is not worth panicing about.

> >
> >
> > > >
> > > >
> 

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-31 Thread Jason Wang
On Wed, May 31, 2023 at 5:14 PM Oleg Nesterov  wrote:
>
> On 05/31, Jason Wang wrote:
> >
> > On Wed, May 31, 2023 at 3:25 PM Oleg Nesterov  wrote:
> > >
> > > On 05/31, Jason Wang wrote:
> > > >
> > > > 在 2023/5/23 20:15, Oleg Nesterov 写道:
> > > > >
> > > > > /* make sure flag is seen after deletion */
> > > > > smp_wmb();
> > > > > llist_for_each_entry_safe(work, work_next, node, node) {
> > > > > clear_bit(VHOST_WORK_QUEUED, &work->flags);
> > > > >
> > > > >I am not sure about smp_wmb + clear_bit. Once we clear 
> > > > >VHOST_WORK_QUEUED,
> > > > >vhost_work_queue() can add this work again and change work->node->next.
> > > > >
> > > > >That is why we use _safe, but we need to ensure that 
> > > > >llist_for_each_safe()
> > > > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
> > > >
> > > > This should be fine since store is not speculated, so work->node->next 
> > > > needs
> > > > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop 
> > > > condition.
> > >
> > > I don't understand you. OK, to simplify, suppose we have 2 global vars
> > >
> > > void *PTR = something_non_null;
> > > unsigned long FLAGS = -1ul;
> > >
> > > Now I think this code
> > >
> > > CPU_0   CPU_1
> > >
> > > void *ptr = PTR;if (!test_and_set_bit(0, FLAGS))
> > > clear_bit(0, FLAGS);PTR = NULL;
> > > BUG_ON(!ptr);
> > >
> > > is racy and can hit the BUG_ON(!ptr).
> >
> > This seems different to the above case?
>
> not sure,
>
> > And you can hit BUG_ON with
> > the following execution sequence:
> >
> > [cpu 0] clear_bit(0, FLAGS);
> > [cpu 1] if (!test_and_set_bit(0, FLAGS))
> > [cpu 1] PTR = NULL;
> > [cpu 0] BUG_ON(!ptr)
>
> I don't understand this part... yes, we can hit this BUG_ON() without mb in
> between, this is what I tried to say.

I may miss something, but the above is the sequence that is executed
by the processor (for each CPU, it's just the program order). So where
do you expect to place an mb can help?

>
> > In vhost code, there's a condition before the clear_bit() which sits
> > inside llist_for_each_entry_safe():
> >
> > #define llist_for_each_entry_safe(pos, n, node, member) 
> >\
> > for (pos = llist_entry((node), typeof(*pos), member);   
> >\
> >  member_address_is_nonnull(pos, member) &&  
> >\
> > (n = llist_entry(pos->member.next, typeof(*n), member), 
> > true); \
> >  pos = n)
> >
> > The clear_bit() is a store which is not speculated, so there's a
> > control dependency, the store can't be executed until the condition
> > expression is evaluated which requires pos->member.next
> > (work->node.next) to be loaded.
>
> But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we 
> have
> something like
>
> n = llist_entry(...);
> if (n)
> clear_bit(...);
>
> so I do not see how can we rely on the load-store control dependency.

Just to make sure we are on the same page, the condition expression is

member_address_is_nonnull(pos, member) && (n =
llist_entry(pos->member.next, typeof(*n), member), true)

So it's something like:

if (work->node && (work_next = work->node->next, true))
clear_bit(&work->flags);

So two loads from both work->node and work->node->next, and there's a
store which is clear_bit, then it's a load-store control dependencies?

Thanks

>
> Oleg.
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_ring: validate used buffer length

2023-05-31 Thread Jason Wang
On Wed, May 31, 2023 at 6:25 PM Michael S. Tsirkin  wrote:
>
> On Wed, May 31, 2023 at 04:26:38PM +0800, Jason Wang wrote:
> > On Wed, May 31, 2023 at 3:36 PM Jason Wang  wrote:
> > >
> > > On Wed, May 31, 2023 at 1:50 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Wed, May 31, 2023 at 09:05:00AM +0800, Jason Wang wrote:
> > > > > On Mon, May 29, 2023 at 6:03 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, May 29, 2023 at 09:18:10AM +0800, Jason Wang wrote:
> > > > > > > On Sun, May 28, 2023 at 3:57 PM Michael S. Tsirkin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, May 26, 2023 at 02:30:41PM +0800, Jason Wang wrote:
> > > > > > > > > This patch validate
> > > > > > > >
> > > > > > > > validates
> > > > > > > >
> > > > > > > > > the used buffer length provided by the device
> > > > > > > > > before trying to use it.
> > > > > > > >
> > > > > > > > before returning it to caller
> > > > > > > >
> > > > > > > > > This is done by remembering the in buffer
> > > > > > > > > length in a dedicated array during virtqueue_add(), then we 
> > > > > > > > > can fail
> > > > > > > > > the virtqueue_get_buf() when we find the device is trying to 
> > > > > > > > > give us a
> > > > > > > > > used buffer length which is greater than we stored before.
> > > > > > > >
> > > > > > > > than what we stored
> > > > > > > >
> > > > > > > > >
> > > > > > > > > This validation is disable
> > > > > > > >
> > > > > > > > disabled
> > > > > > > >
> > > > > > > > > by default via module parameter to unbreak
> > > > > > > > > some existing devices since some legacy devices are known to 
> > > > > > > > > report
> > > > > > > > > buggy used length.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Jason Wang 
> > > > > > > >
> > > > > > > > First I'm not merging this without more data about
> > > > > > > > what is known to be broken and what is known to work well
> > > > > > > > in the commit log. And how exactly do things work if used length
> > > > > > > > is wrong?
> > > > > > >
> > > > > > > Assuming the device is malicious, it would be very hard to answer.
> > > > > > > Auditing and fuzzing won't cover every case. Instead of trying to 
> > > > > > > seek
> > > > > > > the answer, we can simply make sure the used in buffer length is
> > > > > > > validated then we know we're fine or not.
> > > > > >
> > > > > > To restate the question, you said above "some legacy devices are 
> > > > > > known
> > > > > > to report buggy used length". If they report buggy length then how
> > > > > > can things work?
> > > > >
> > > > > The validation is disabled for legacy device (as stated in the 
> > > > > changelog):
> > > > >
> > > > > static bool vring_needs_used_validation(const struct virtio_device 
> > > > > *vdev)
> > > > > {
> > > > > /*
> > > > >  * Several legacy devices are known to produce buggy used
> > > > >  * length. In order to let driver work, we won't validate used
> > > > >  * buffer length in this case.
> > > > >  */
> > > > > if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
> > > > > return false;
> > > > > if (force_used_validation)
> > > > > return true;
> > > > > return false;
> > > > > }
> > > > >
> > > > > This seems to be what we've agreed in last version:
> > > > >
> > > > > https://lore.kernel.org/all/canlsykxfhamuu0bb4j7y6n4_g9odkxlcjxxgxex4sj6_kf+...@mail.gmail.com/T/#m31f3b06f9032beec175c312dfa2532cb08b15c56
> > > > >
> > > > > Thanks
> > > > >
> > > >
> > > > I don't get it. You wrote:
> > > >
> > > > This validation is disable
> > > > by default via module parameter to unbreak
> > > > some existing devices since some legacy devices are known to 
> > > > report
> > > > buggy used length.
> > > >
> > > > which devices?
> > >
> > > legacy rpmsg and vsock device (before 49d8c5ffad07) at least.
> > >
> > > > why do you need a module parameter?
> > >
> > > If we enable it unconditionally for modern devices, it may break some
> > > buggy moden device (vsock without a fix as an example).
> > >
> > > >
> > > >
> > > > > >
> > > > > > > > Second what's wrong with dma_desc_extra that we already 
> > > > > > > > maintain?
> > > > > > > > Third motivation - it's part and parcel of the hardening effort 
> > > > > > > > yes?
> > > > > > >
> > > > > > > They are different. dma_desc_extra is for a descriptor ring, but 
> > > > > > > this
> > > > > > > is for a used ring. Technically we can go back to iterate on the
> > > > > > > descriptor ring for a legal used in buffer length. But it will 
> > > > > > > have
> > > > > > > worse performance.
> > > > > >
> > > > > > I don't really understand. We already iterate when we unmap -
> > > > > > all that is necessary is to subtract it from used length, if at
> > > > > > the end of the process it is >0 then we know used length is too
> > > > > > large.
> > > > >
> > > > > Yes, but it is the job that is done

Re: [PATCH] virtio_ring: validate used buffer length

2023-05-31 Thread Jason Wang
On Wed, May 31, 2023 at 5:55 PM Michael S. Tsirkin  wrote:
>
> On Wed, May 31, 2023 at 03:36:51PM +0800, Jason Wang wrote:
> > On Wed, May 31, 2023 at 1:50 PM Michael S. Tsirkin  wrote:
> > >
> > > On Wed, May 31, 2023 at 09:05:00AM +0800, Jason Wang wrote:
> > > > On Mon, May 29, 2023 at 6:03 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Mon, May 29, 2023 at 09:18:10AM +0800, Jason Wang wrote:
> > > > > > On Sun, May 28, 2023 at 3:57 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, May 26, 2023 at 02:30:41PM +0800, Jason Wang wrote:
> > > > > > > > This patch validate
> > > > > > >
> > > > > > > validates
> > > > > > >
> > > > > > > > the used buffer length provided by the device
> > > > > > > > before trying to use it.
> > > > > > >
> > > > > > > before returning it to caller
> > > > > > >
> > > > > > > > This is done by remembering the in buffer
> > > > > > > > length in a dedicated array during virtqueue_add(), then we can 
> > > > > > > > fail
> > > > > > > > the virtqueue_get_buf() when we find the device is trying to 
> > > > > > > > give us a
> > > > > > > > used buffer length which is greater than we stored before.
> > > > > > >
> > > > > > > than what we stored
> > > > > > >
> > > > > > > >
> > > > > > > > This validation is disable
> > > > > > >
> > > > > > > disabled
> > > > > > >
> > > > > > > > by default via module parameter to unbreak
> > > > > > > > some existing devices since some legacy devices are known to 
> > > > > > > > report
> > > > > > > > buggy used length.
> > > > > > > >
> > > > > > > > Signed-off-by: Jason Wang 
> > > > > > >
> > > > > > > First I'm not merging this without more data about
> > > > > > > what is known to be broken and what is known to work well
> > > > > > > in the commit log. And how exactly do things work if used length
> > > > > > > is wrong?
> > > > > >
> > > > > > Assuming the device is malicious, it would be very hard to answer.
> > > > > > Auditing and fuzzing won't cover every case. Instead of trying to 
> > > > > > seek
> > > > > > the answer, we can simply make sure the used in buffer length is
> > > > > > validated then we know we're fine or not.
> > > > >
> > > > > To restate the question, you said above "some legacy devices are known
> > > > > to report buggy used length". If they report buggy length then how
> > > > > can things work?
> > > >
> > > > The validation is disabled for legacy device (as stated in the 
> > > > changelog):
> > > >
> > > > static bool vring_needs_used_validation(const struct virtio_device 
> > > > *vdev)
> > > > {
> > > > /*
> > > >  * Several legacy devices are known to produce buggy used
> > > >  * length. In order to let driver work, we won't validate used
> > > >  * buffer length in this case.
> > > >  */
> > > > if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
> > > > return false;
> > > > if (force_used_validation)
> > > > return true;
> > > > return false;
> > > > }
> > > >
> > > > This seems to be what we've agreed in last version:
> > > >
> > > > https://lore.kernel.org/all/canlsykxfhamuu0bb4j7y6n4_g9odkxlcjxxgxex4sj6_kf+...@mail.gmail.com/T/#m31f3b06f9032beec175c312dfa2532cb08b15c56
> > > >
> > > > Thanks
> > > >
> > >
> > > I don't get it. You wrote:
> > >
> > > This validation is disable
> > > by default via module parameter to unbreak
> > > some existing devices since some legacy devices are known to 
> > > report
> > > buggy used length.
> > >
> > > which devices?
> >
> > legacy rpmsg and vsock device (before 49d8c5ffad07) at least.
> >
> > > why do you need a module parameter?
> >
> > If we enable it unconditionally for modern devices, it may break some
> > buggy moden device (vsock without a fix as an example).
>
> Presumably this happens because vsock does not have any inbuf at all
> so it ignores the length.
> We had the same with virtio net tx a long time ago.
>
> My suggestion is then not to fail - just cap length at the dma length
> set by driver. Another idea is that if dma len is 0 then don't validate
> at all - driver that did not add any inbufs is not going to look
> at length.
>
> Allowing passing NULL as length and skipping validation
> if len = 0 might also be a good idea.

The above should work for sure, but a question is, in the case of
confidential computing, if a driver detects a malicious device, is it
really good to try to keep working or warn and disable the device? Or
maybe we can have an option for the user to decide?

>
>
> > >
> > >
> > > > >
> > > > > > > Second what's wrong with dma_desc_extra that we already maintain?
> > > > > > > Third motivation - it's part and parcel of the hardening effort 
> > > > > > > yes?
> > > > > >
> > > > > > They are different. dma_desc_extra is for a descriptor ring, but 
> > > > > > this
> > > > > > is for a used ring. Technically we can go back to iterate on the
> > > > > > d

Re: [PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-05-31 Thread Vivek Goyal
On Wed, May 31, 2023 at 10:34:15PM +0200, Peter-Jan Gootzen wrote:
> On 31/05/2023 21:18, Vivek Goyal wrote:
> > On Wed, May 31, 2023 at 07:10:32PM +0200, Peter-Jan Gootzen wrote:
> >> When the Virtio queue is full, a work item is scheduled
> >> to execute in 1ms that retries adding the request to the queue.
> >> This is a large amount of time on the scale on which a
> >> virtio-fs device can operate. When using a DPU this is around
> >> 40us baseline without going to a remote server (4k, QD=1).
> >> This patch queues requests when the Virtio queue is full,
> >> and when a completed request is taken off, immediately fills
> >> it back up with queued requests.
> >>
> >> This reduces the 99.9th percentile latencies in our tests by
> >> 60x and slightly increases the overall throughput, when using a
> >> queue depth 2x the size of the Virtio queue size, with a
> >> DPU-powered virtio-fs device.
> >>
> >> Signed-off-by: Peter-Jan Gootzen 
> >> ---
> >> V1 -> V2: Not scheduling dispatch work anymore when not needed
> >> and changed delayed_work structs to work_struct structs
> >>
> >>  fs/fuse/virtio_fs.c | 32 +---
> >>  1 file changed, 17 insertions(+), 15 deletions(-)
> >>
> >> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> >> index 4d8d4f16c727..a676297db09b 100644
> >> --- a/fs/fuse/virtio_fs.c
> >> +++ b/fs/fuse/virtio_fs.c
> >> @@ -45,7 +45,7 @@ struct virtio_fs_vq {
> >>struct work_struct done_work;
> >>struct list_head queued_reqs;
> >>struct list_head end_reqs;  /* End these requests */
> >> -  struct delayed_work dispatch_work;
> >> +  struct work_struct dispatch_work;
> >>struct fuse_dev *fud;
> >>bool connected;
> >>long in_flight;
> >> @@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct virtio_fs_vq 
> >> *fsvq)
> >>}
> >>  
> >>flush_work(&fsvq->done_work);
> >> -  flush_delayed_work(&fsvq->dispatch_work);
> >> +  flush_work(&fsvq->dispatch_work);
> >>  }
> >>  
> >>  static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
> >> @@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct 
> >> work_struct *work)
> >>dec_in_flight_req(fsvq);
> >>}
> >>} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
> >> +
> >> +  if (!list_empty(&fsvq->queued_reqs))
> >> +  schedule_work(&fsvq->dispatch_work);
> >>spin_unlock(&fsvq->lock);
> >>  }
> >>  
> >> @@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
> >> work_struct *work)
> >>  {
> >>struct fuse_req *req;
> >>struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
> >> -   dispatch_work.work);
> >> +   dispatch_work);
> >>int ret;
> >>  
> >>pr_debug("virtio-fs: worker %s called.\n", __func__);
> >> @@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
> >> work_struct *work)
> >>if (ret == -ENOMEM || ret == -ENOSPC) {
> >>spin_lock(&fsvq->lock);
> >>list_add_tail(&req->list, &fsvq->queued_reqs);
> >> -  schedule_delayed_work(&fsvq->dispatch_work,
> >> -msecs_to_jiffies(1));
> > 
> > Virtqueue being full is only one of the reasons for failure to queue
> > the request. What if virtqueue is empty but we could not queue the
> > request because lack of memory (-ENOMEM). In that case we will queue
> > the request and it might not be dispatched because there is no completion.
> > (Assume there is no further new request coming). That means deadlock?
> > 
> > Thanks
> > Vivek
> > 
> 
> Good catch that will deadlock.
> 
> Is default kernel behavior to indefinitely retry a file system
> request until memory is available?

As of now that seems to be the behavior. I think I had copied this
code from another driver. 

But I guess one can argue that if memory is not available, then
return -ENOMEM to user space instead of retrying in kernel.

Stefan, Miklos, WDYT?

Thanks
Vivek

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-05-31 Thread Peter-Jan Gootzen via Virtualization
On 31/05/2023 21:18, Vivek Goyal wrote:
> On Wed, May 31, 2023 at 07:10:32PM +0200, Peter-Jan Gootzen wrote:
>> When the Virtio queue is full, a work item is scheduled
>> to execute in 1ms that retries adding the request to the queue.
>> This is a large amount of time on the scale on which a
>> virtio-fs device can operate. When using a DPU this is around
>> 40us baseline without going to a remote server (4k, QD=1).
>> This patch queues requests when the Virtio queue is full,
>> and when a completed request is taken off, immediately fills
>> it back up with queued requests.
>>
>> This reduces the 99.9th percentile latencies in our tests by
>> 60x and slightly increases the overall throughput, when using a
>> queue depth 2x the size of the Virtio queue size, with a
>> DPU-powered virtio-fs device.
>>
>> Signed-off-by: Peter-Jan Gootzen 
>> ---
>> V1 -> V2: Not scheduling dispatch work anymore when not needed
>> and changed delayed_work structs to work_struct structs
>>
>>  fs/fuse/virtio_fs.c | 32 +---
>>  1 file changed, 17 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
>> index 4d8d4f16c727..a676297db09b 100644
>> --- a/fs/fuse/virtio_fs.c
>> +++ b/fs/fuse/virtio_fs.c
>> @@ -45,7 +45,7 @@ struct virtio_fs_vq {
>>  struct work_struct done_work;
>>  struct list_head queued_reqs;
>>  struct list_head end_reqs;  /* End these requests */
>> -struct delayed_work dispatch_work;
>> +struct work_struct dispatch_work;
>>  struct fuse_dev *fud;
>>  bool connected;
>>  long in_flight;
>> @@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct virtio_fs_vq 
>> *fsvq)
>>  }
>>  
>>  flush_work(&fsvq->done_work);
>> -flush_delayed_work(&fsvq->dispatch_work);
>> +flush_work(&fsvq->dispatch_work);
>>  }
>>  
>>  static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
>> @@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct 
>> work_struct *work)
>>  dec_in_flight_req(fsvq);
>>  }
>>  } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
>> +
>> +if (!list_empty(&fsvq->queued_reqs))
>> +schedule_work(&fsvq->dispatch_work);
>>  spin_unlock(&fsvq->lock);
>>  }
>>  
>> @@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
>> work_struct *work)
>>  {
>>  struct fuse_req *req;
>>  struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
>> - dispatch_work.work);
>> + dispatch_work);
>>  int ret;
>>  
>>  pr_debug("virtio-fs: worker %s called.\n", __func__);
>> @@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
>> work_struct *work)
>>  if (ret == -ENOMEM || ret == -ENOSPC) {
>>  spin_lock(&fsvq->lock);
>>  list_add_tail(&req->list, &fsvq->queued_reqs);
>> -schedule_delayed_work(&fsvq->dispatch_work,
>> -  msecs_to_jiffies(1));
> 
> Virtqueue being full is only one of the reasons for failure to queue
> the request. What if virtqueue is empty but we could not queue the
> request because lack of memory (-ENOMEM). In that case we will queue
> the request and it might not be dispatched because there is no completion.
> (Assume there is no further new request coming). That means deadlock?
> 
> Thanks
> Vivek
> 

Good catch that will deadlock.

Is default kernel behavior to indefinitely retry a file system
request until memory is available?

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-05-31 Thread Vivek Goyal
On Wed, May 31, 2023 at 07:10:32PM +0200, Peter-Jan Gootzen wrote:
> When the Virtio queue is full, a work item is scheduled
> to execute in 1ms that retries adding the request to the queue.
> This is a large amount of time on the scale on which a
> virtio-fs device can operate. When using a DPU this is around
> 40us baseline without going to a remote server (4k, QD=1).
> This patch queues requests when the Virtio queue is full,
> and when a completed request is taken off, immediately fills
> it back up with queued requests.
> 
> This reduces the 99.9th percentile latencies in our tests by
> 60x and slightly increases the overall throughput, when using a
> queue depth 2x the size of the Virtio queue size, with a
> DPU-powered virtio-fs device.
> 
> Signed-off-by: Peter-Jan Gootzen 
> ---
> V1 -> V2: Not scheduling dispatch work anymore when not needed
> and changed delayed_work structs to work_struct structs
> 
>  fs/fuse/virtio_fs.c | 32 +---
>  1 file changed, 17 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 4d8d4f16c727..a676297db09b 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -45,7 +45,7 @@ struct virtio_fs_vq {
>   struct work_struct done_work;
>   struct list_head queued_reqs;
>   struct list_head end_reqs;  /* End these requests */
> - struct delayed_work dispatch_work;
> + struct work_struct dispatch_work;
>   struct fuse_dev *fud;
>   bool connected;
>   long in_flight;
> @@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct virtio_fs_vq 
> *fsvq)
>   }
>  
>   flush_work(&fsvq->done_work);
> - flush_delayed_work(&fsvq->dispatch_work);
> + flush_work(&fsvq->dispatch_work);
>  }
>  
>  static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
> @@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct work_struct 
> *work)
>   dec_in_flight_req(fsvq);
>   }
>   } while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
> +
> + if (!list_empty(&fsvq->queued_reqs))
> + schedule_work(&fsvq->dispatch_work);
>   spin_unlock(&fsvq->lock);
>  }
>  
> @@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
> work_struct *work)
>  {
>   struct fuse_req *req;
>   struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
> -  dispatch_work.work);
> +  dispatch_work);
>   int ret;
>  
>   pr_debug("virtio-fs: worker %s called.\n", __func__);
> @@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
> work_struct *work)
>   if (ret == -ENOMEM || ret == -ENOSPC) {
>   spin_lock(&fsvq->lock);
>   list_add_tail(&req->list, &fsvq->queued_reqs);
> - schedule_delayed_work(&fsvq->dispatch_work,
> -   msecs_to_jiffies(1));

Virtqueue being full is only one of the reasons for failure to queue
the request. What if virtqueue is empty but we could not queue the
request because lack of memory (-ENOMEM). In that case we will queue
the request and it might not be dispatched because there is no completion.
(Assume there is no further new request coming). That means deadlock?

Thanks
Vivek

>   spin_unlock(&fsvq->lock);
>   return;
>   }
> @@ -436,8 +437,6 @@ static int send_forget_request(struct virtio_fs_vq *fsvq,
>   pr_debug("virtio-fs: Could not queue FORGET: err=%d. 
> Will try later\n",
>ret);
>   list_add_tail(&forget->list, &fsvq->queued_reqs);
> - schedule_delayed_work(&fsvq->dispatch_work,
> -   msecs_to_jiffies(1));
>   if (!in_flight)
>   inc_in_flight_req(fsvq);
>   /* Queue is full */
> @@ -469,7 +468,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
> work_struct *work)
>  {
>   struct virtio_fs_forget *forget;
>   struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
> -  dispatch_work.work);
> +  dispatch_work);
>   pr_debug("virtio-fs: worker %s called.\n", __func__);
>   while (1) {
>   spin_lock(&fsvq->lock);
> @@ -647,6 +646,11 @@ static void virtio_fs_requests_done_work(struct 
> work_struct *work)
>   virtio_fs_request_complete(req, fsvq);
>   }
>   }
> +
> + spin_lock(&fsvq->lock);
> + if (!list_empty(&fsvq->queued_reqs))
> + schedule_work(&fsvq->dispatch_work);
> + spin_unlock(&fsvq->lock);
>  }
>  
>

[PATCH V2] virtio-fs: Improved request latencies when Virtio queue is full

2023-05-31 Thread Peter-Jan Gootzen via Virtualization
When the Virtio queue is full, a work item is scheduled
to execute in 1ms that retries adding the request to the queue.
This is a large amount of time on the scale on which a
virtio-fs device can operate. When using a DPU this is around
40us baseline without going to a remote server (4k, QD=1).
This patch queues requests when the Virtio queue is full,
and when a completed request is taken off, immediately fills
it back up with queued requests.

This reduces the 99.9th percentile latencies in our tests by
60x and slightly increases the overall throughput, when using a
queue depth 2x the size of the Virtio queue size, with a
DPU-powered virtio-fs device.

Signed-off-by: Peter-Jan Gootzen 
---
V1 -> V2: Not scheduling dispatch work anymore when not needed
and changed delayed_work structs to work_struct structs

 fs/fuse/virtio_fs.c | 32 +---
 1 file changed, 17 insertions(+), 15 deletions(-)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 4d8d4f16c727..a676297db09b 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -45,7 +45,7 @@ struct virtio_fs_vq {
struct work_struct done_work;
struct list_head queued_reqs;
struct list_head end_reqs;  /* End these requests */
-   struct delayed_work dispatch_work;
+   struct work_struct dispatch_work;
struct fuse_dev *fud;
bool connected;
long in_flight;
@@ -202,7 +202,7 @@ static void virtio_fs_drain_queue(struct virtio_fs_vq *fsvq)
}
 
flush_work(&fsvq->done_work);
-   flush_delayed_work(&fsvq->dispatch_work);
+   flush_work(&fsvq->dispatch_work);
 }
 
 static void virtio_fs_drain_all_queues_locked(struct virtio_fs *fs)
@@ -346,6 +346,9 @@ static void virtio_fs_hiprio_done_work(struct work_struct 
*work)
dec_in_flight_req(fsvq);
}
} while (!virtqueue_enable_cb(vq) && likely(!virtqueue_is_broken(vq)));
+
+   if (!list_empty(&fsvq->queued_reqs))
+   schedule_work(&fsvq->dispatch_work);
spin_unlock(&fsvq->lock);
 }
 
@@ -353,7 +356,7 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
 {
struct fuse_req *req;
struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
-dispatch_work.work);
+dispatch_work);
int ret;
 
pr_debug("virtio-fs: worker %s called.\n", __func__);
@@ -388,8 +391,6 @@ static void virtio_fs_request_dispatch_work(struct 
work_struct *work)
if (ret == -ENOMEM || ret == -ENOSPC) {
spin_lock(&fsvq->lock);
list_add_tail(&req->list, &fsvq->queued_reqs);
-   schedule_delayed_work(&fsvq->dispatch_work,
- msecs_to_jiffies(1));
spin_unlock(&fsvq->lock);
return;
}
@@ -436,8 +437,6 @@ static int send_forget_request(struct virtio_fs_vq *fsvq,
pr_debug("virtio-fs: Could not queue FORGET: err=%d. 
Will try later\n",
 ret);
list_add_tail(&forget->list, &fsvq->queued_reqs);
-   schedule_delayed_work(&fsvq->dispatch_work,
- msecs_to_jiffies(1));
if (!in_flight)
inc_in_flight_req(fsvq);
/* Queue is full */
@@ -469,7 +468,7 @@ static void virtio_fs_hiprio_dispatch_work(struct 
work_struct *work)
 {
struct virtio_fs_forget *forget;
struct virtio_fs_vq *fsvq = container_of(work, struct virtio_fs_vq,
-dispatch_work.work);
+dispatch_work);
pr_debug("virtio-fs: worker %s called.\n", __func__);
while (1) {
spin_lock(&fsvq->lock);
@@ -647,6 +646,11 @@ static void virtio_fs_requests_done_work(struct 
work_struct *work)
virtio_fs_request_complete(req, fsvq);
}
}
+
+   spin_lock(&fsvq->lock);
+   if (!list_empty(&fsvq->queued_reqs))
+   schedule_work(&fsvq->dispatch_work);
+   spin_unlock(&fsvq->lock);
 }
 
 /* Virtqueue interrupt handler */
@@ -670,12 +674,12 @@ static void virtio_fs_init_vq(struct virtio_fs_vq *fsvq, 
char *name,
 
if (vq_type == VQ_REQUEST) {
INIT_WORK(&fsvq->done_work, virtio_fs_requests_done_work);
-   INIT_DELAYED_WORK(&fsvq->dispatch_work,
- virtio_fs_request_dispatch_work);
+   INIT_WORK(&fsvq->dispatch_work,
+ virtio_fs_request_dispatch_work);
} else {
INIT_WORK(&fsvq->done_work, virtio_fs

Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-05-31 Thread Mike Christie
On 5/31/23 10:15 AM, Mike Christie wrote:
>>> rcu would work for your case and for what Jason had requested.
>> Yeah, so you already have some patches?
>>
>> Do you want to send it to solve this problem?
>>
> Yeah, I'll break them out and send them later today when I can retest
> rebased patches.
> 

Just one question. Do you core vhost developers consider RCU more complex
or switching to READ_ONCE/WRITE_ONCE? I am asking because for this immediate
regression fix we could just switch to the latter like below to just fix
the crash if we think that is more simple.

I think RCU is just a little more complex/invasive because it will have the
extra synchronize_rcu calls.


diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index a92af08e7864..03fd47a22a73 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -235,7 +235,7 @@ void vhost_dev_flush(struct vhost_dev *dev)
 {
struct vhost_flush_struct flush;
 
-   if (dev->worker) {
+   if (READ_ONCE(dev->worker.vtsk)) {
init_completion(&flush.wait_event);
vhost_work_init(&flush.work, vhost_flush_work);
 
@@ -247,7 +247,9 @@ EXPORT_SYMBOL_GPL(vhost_dev_flush);
 
 void vhost_work_queue(struct vhost_dev *dev, struct vhost_work *work)
 {
-   if (!dev->worker)
+   struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk);
+
+   if (!vtsk)
return;
 
if (!test_and_set_bit(VHOST_WORK_QUEUED, &work->flags)) {
@@ -255,8 +257,8 @@ void vhost_work_queue(struct vhost_dev *dev, struct 
vhost_work *work)
 * sure it was not in the list.
 * test_and_set_bit() implies a memory barrier.
 */
-   llist_add(&work->node, &dev->worker->work_list);
-   wake_up_process(dev->worker->vtsk->task);
+   llist_add(&work->node, &dev->worker.work_list);
+   wake_up_process(vtsk->task);
}
 }
 EXPORT_SYMBOL_GPL(vhost_work_queue);
@@ -264,7 +266,7 @@ EXPORT_SYMBOL_GPL(vhost_work_queue);
 /* A lockless hint for busy polling code to exit the loop */
 bool vhost_has_work(struct vhost_dev *dev)
 {
-   return dev->worker && !llist_empty(&dev->worker->work_list);
+   return !llist_empty(&dev->worker.work_list);
 }
 EXPORT_SYMBOL_GPL(vhost_has_work);
 
@@ -468,7 +470,7 @@ void vhost_dev_init(struct vhost_dev *dev,
dev->umem = NULL;
dev->iotlb = NULL;
dev->mm = NULL;
-   dev->worker = NULL;
+   memset(&dev->worker, 0, sizeof(dev->worker));
dev->iov_limit = iov_limit;
dev->weight = weight;
dev->byte_weight = byte_weight;
@@ -542,46 +544,38 @@ static void vhost_detach_mm(struct vhost_dev *dev)
 
 static void vhost_worker_free(struct vhost_dev *dev)
 {
-   struct vhost_worker *worker = dev->worker;
+   struct vhost_task *vtsk = READ_ONCE(dev->worker.vtsk);
 
-   if (!worker)
+   if (!vtsk)
return;
 
-   dev->worker = NULL;
-   WARN_ON(!llist_empty(&worker->work_list));
-   vhost_task_stop(worker->vtsk);
-   kfree(worker);
+   vhost_task_stop(vtsk);
+   WARN_ON(!llist_empty(&dev->worker.work_list));
+   WRITE_ONCE(dev->worker.vtsk, NULL);
 }
 
 static int vhost_worker_create(struct vhost_dev *dev)
 {
-   struct vhost_worker *worker;
struct vhost_task *vtsk;
char name[TASK_COMM_LEN];
int ret;
 
-   worker = kzalloc(sizeof(*worker), GFP_KERNEL_ACCOUNT);
-   if (!worker)
-   return -ENOMEM;
-
-   dev->worker = worker;
-   worker->kcov_handle = kcov_common_handle();
-   init_llist_head(&worker->work_list);
+   dev->worker.kcov_handle = kcov_common_handle();
+   init_llist_head(&dev->worker.work_list);
snprintf(name, sizeof(name), "vhost-%d", current->pid);
 
-   vtsk = vhost_task_create(vhost_worker, worker, name);
+   vtsk = vhost_task_create(vhost_worker, &dev->worker, name);
if (!vtsk) {
ret = -ENOMEM;
goto free_worker;
}
 
-   worker->vtsk = vtsk;
+   WRITE_ONCE(dev->worker.vtsk, vtsk);
vhost_task_start(vtsk);
return 0;
 
 free_worker:
-   kfree(worker);
-   dev->worker = NULL;
+   WRITE_ONCE(dev->worker.vtsk, NULL);
return ret;
 }
 
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 0308638cdeee..305ec8593d46 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -154,7 +154,7 @@ struct vhost_dev {
struct vhost_virtqueue **vqs;
int nvqs;
struct eventfd_ctx *log_ctx;
-   struct vhost_worker *worker;
+   struct vhost_worker worker;
struct vhost_iotlb *umem;
struct vhost_iotlb *iotlb;
spinlock_t iotlb_lock;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] scsi: virtio_scsi: Remove a useless function call

2023-05-31 Thread Martin K. Petersen


Christophe,

> 'inq_result' is known to be NULL. There is no point calling kfree().

Applied to 6.5/scsi-staging, thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-05-31 Thread Mike Christie
On 5/31/23 2:27 AM, Stefano Garzarella wrote:
> On Tue, May 30, 2023 at 6:30 PM  wrote:
>>
>> On 5/30/23 11:17 AM, Stefano Garzarella wrote:
>>> On Tue, May 30, 2023 at 11:09:09AM -0500, Mike Christie wrote:
 On 5/30/23 11:00 AM, Stefano Garzarella wrote:
> I think it is partially related to commit 6e890c5d5021 ("vhost: use
> vhost_tasks for worker threads") and commit 1a5f8090c6de ("vhost: move
> worker thread fields to new struct"). Maybe that commits just
> highlighted the issue and it was already existing.

 See my mail about the crash. Agree with your analysis about worker->vtsk
 not being set yet. It's a bug from my commit where I should have not set
 it so early or I should be checking for

 if (dev->worker && worker->vtsk)

 instead of

 if (dev->worker)
>>>
>>> Yes, though, in my opinion the problem may persist depending on how the
>>> instructions are reordered.
>>
>> Ah ok.
>>
>>>
>>> Should we protect dev->worker() with an RCU to be safe?
>>
>> For those multiple worker patchsets Jason had asked me about supporting
>> where we don't have a worker while we are swapping workers around. To do
>> that I had added rcu around the dev->worker. I removed it in later patchsets
>> because I didn't think anyone would use it.
>>
>> rcu would work for your case and for what Jason had requested.
> 
> Yeah, so you already have some patches?
> 
> Do you want to send it to solve this problem?
> 

Yeah, I'll break them out and send them later today when I can retest
rebased patches.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH v2 3/3] PCI: endpoint: Add EP function driver to provide virtio-console functionality

2023-05-31 Thread Shunsuke Mie

I'm sorry for late response.

On 2023/05/19 11:01, Jason Wang wrote:

On Thu, May 18, 2023 at 5:54 PM Shunsuke Mie  wrote:

Gentle ping ...


Thanks,

Shunsuke.

On 2023/05/10 12:17, Shunsuke Mie wrote:

Hi Json,
2023年5月8日(月) 13:03 Jason Wang :

On Thu, Apr 27, 2023 at 6:44 PM Shunsuke Mie  wrote:

Add a new PCIe endpoint function driver that works as a pci virtio-console
device. The console connect to endpoint side console. It enables to
communicate PCIe host and endpoint.

Architecture is following:

   ┌┐ ┌──┬┐
   │virtioe │ │  │virtio  │
   │console drv │ ├───┐  │console drv │
   ├┤ │(virtio console│  ├┤
   │ virtio bus │ │ device)   │◄►│ virtio bus │
   ├┤ ├---┤  └┤
   ││ │ pci ep virtio │   │
   │  pci bus   │ │  console drv  │   │
   ││  pcie   ├───┤   │
   ││ ◄─► │  pci ep Bus   │   │
   └┘ └───┴───┘
 PCIe Root  PCIe Endpoint


I think it might only works for peer devices like:

net, console or vsock.

Could you tell me what "peer devices" means?

I meant, for example we know in the case of virtio-net, TX can talk
with RX belonging to another device directly.

But this is not the case for other devices like virito-blk.

Thank you. I comprehended it.

So there're many choices here, I'd like to know what's the reason for
you to implement a mediation.

An alternative is to implement a dedicated net, console and vsock
driver for vringh (CAIF somehow works like this). This would have
better performance.

Does it mean that the driver also functions as a network driver directly?

I meant, e.g in the case of networking, you can have a dedicated
driver with two vringh in the endpoint side.

The benefit is the performance, no need for the (datapath) mediation.

But if we don't care about the performance, this proposal seems to be fine.

Thanks
I agree with you.  The design you suggested is better in terms of 
performance.
However, the proposed design is not bad for the following the reasons I 
think.


The proposed design has one more operation in control plane because the data
steps over the virtio-net driver, but the number of copies at the data plane
remains the same. I think the operation added in control plane has small 
effects

for performance.

Moreover, there are some advantages when the data step over the virtio-net
driver. We can make use of the optimizations and some functions without
modifications and implementations. e.g. ethtool and XDP(BPF) supports.

Any comments would be appreciated.


This driver has two roles. The first is as a PCIe endpoint virtio console
function, which is implemented using the PCIe endpoint framework and PCIe
EP virtio helpers. The second is as a virtual virtio console device
connected to the virtio bus on PCIe endpoint Linux.

Communication between the two is achieved by copying the virtqueue data
between PCIe root and endpoint, respectively.

This is a simple implementation and does not include features of
virtio-console such as MULTIPORT, EMERG_WRITE, etc. As a result, each
virtio console driver only displays /dev/hvc0.

As an example of usage, by setting getty to /dev/hvc0, it is possible to
login to another host.

Signed-off-by: Shunsuke Mie 
---
Changes from v2:
- Change to use copy functions between kiovs of pci-epf-virtio.

   drivers/pci/endpoint/functions/Kconfig|  12 +
   drivers/pci/endpoint/functions/Makefile   |   1 +
   drivers/pci/endpoint/functions/pci-epf-vcon.c | 596 ++
   3 files changed, 609 insertions(+)
   create mode 100644 drivers/pci/endpoint/functions/pci-epf-vcon.c

diff --git a/drivers/pci/endpoint/functions/Kconfig 
b/drivers/pci/endpoint/functions/Kconfig
index fa1a6a569a8f..9ce2698b67e1 100644
--- a/drivers/pci/endpoint/functions/Kconfig
+++ b/drivers/pci/endpoint/functions/Kconfig
@@ -44,3 +44,15 @@ config PCI_EPF_VIRTIO
  select VHOST_RING_IOMEM
  help
Helpers to implement PCI virtio Endpoint function
+
+config PCI_EPF_VCON
+   tristate "PCI Endpoint virito-console driver"
+   depends on PCI_ENDPOINT
+   select VHOST_RING
+   select PCI_EPF_VIRTIO
+   help
+ PCIe Endpoint virtio-console function implementatino. This module
+ enables to show the virtio-console as pci device to PCIe host side, 
and
+ another virtual virtio-console device registers to endpoint system.
+ Those devices are connected virtually and can communicate each other.
+
diff --git a/drivers/pci/endpoint/functions/Makefile 
b/drivers/pci/endpoint/functions/Makefile
index a96f127ce900..b4056689ce33 100644
--- a/drivers/pci/endpoint/functions/Makefile
+++ b/driver

Re: [PATCH] virtio_ring: validate used buffer length

2023-05-31 Thread Michael S. Tsirkin
On Wed, May 31, 2023 at 04:26:38PM +0800, Jason Wang wrote:
> On Wed, May 31, 2023 at 3:36 PM Jason Wang  wrote:
> >
> > On Wed, May 31, 2023 at 1:50 PM Michael S. Tsirkin  wrote:
> > >
> > > On Wed, May 31, 2023 at 09:05:00AM +0800, Jason Wang wrote:
> > > > On Mon, May 29, 2023 at 6:03 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Mon, May 29, 2023 at 09:18:10AM +0800, Jason Wang wrote:
> > > > > > On Sun, May 28, 2023 at 3:57 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, May 26, 2023 at 02:30:41PM +0800, Jason Wang wrote:
> > > > > > > > This patch validate
> > > > > > >
> > > > > > > validates
> > > > > > >
> > > > > > > > the used buffer length provided by the device
> > > > > > > > before trying to use it.
> > > > > > >
> > > > > > > before returning it to caller
> > > > > > >
> > > > > > > > This is done by remembering the in buffer
> > > > > > > > length in a dedicated array during virtqueue_add(), then we can 
> > > > > > > > fail
> > > > > > > > the virtqueue_get_buf() when we find the device is trying to 
> > > > > > > > give us a
> > > > > > > > used buffer length which is greater than we stored before.
> > > > > > >
> > > > > > > than what we stored
> > > > > > >
> > > > > > > >
> > > > > > > > This validation is disable
> > > > > > >
> > > > > > > disabled
> > > > > > >
> > > > > > > > by default via module parameter to unbreak
> > > > > > > > some existing devices since some legacy devices are known to 
> > > > > > > > report
> > > > > > > > buggy used length.
> > > > > > > >
> > > > > > > > Signed-off-by: Jason Wang 
> > > > > > >
> > > > > > > First I'm not merging this without more data about
> > > > > > > what is known to be broken and what is known to work well
> > > > > > > in the commit log. And how exactly do things work if used length
> > > > > > > is wrong?
> > > > > >
> > > > > > Assuming the device is malicious, it would be very hard to answer.
> > > > > > Auditing and fuzzing won't cover every case. Instead of trying to 
> > > > > > seek
> > > > > > the answer, we can simply make sure the used in buffer length is
> > > > > > validated then we know we're fine or not.
> > > > >
> > > > > To restate the question, you said above "some legacy devices are known
> > > > > to report buggy used length". If they report buggy length then how
> > > > > can things work?
> > > >
> > > > The validation is disabled for legacy device (as stated in the 
> > > > changelog):
> > > >
> > > > static bool vring_needs_used_validation(const struct virtio_device 
> > > > *vdev)
> > > > {
> > > > /*
> > > >  * Several legacy devices are known to produce buggy used
> > > >  * length. In order to let driver work, we won't validate used
> > > >  * buffer length in this case.
> > > >  */
> > > > if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
> > > > return false;
> > > > if (force_used_validation)
> > > > return true;
> > > > return false;
> > > > }
> > > >
> > > > This seems to be what we've agreed in last version:
> > > >
> > > > https://lore.kernel.org/all/canlsykxfhamuu0bb4j7y6n4_g9odkxlcjxxgxex4sj6_kf+...@mail.gmail.com/T/#m31f3b06f9032beec175c312dfa2532cb08b15c56
> > > >
> > > > Thanks
> > > >
> > >
> > > I don't get it. You wrote:
> > >
> > > This validation is disable
> > > by default via module parameter to unbreak
> > > some existing devices since some legacy devices are known to 
> > > report
> > > buggy used length.
> > >
> > > which devices?
> >
> > legacy rpmsg and vsock device (before 49d8c5ffad07) at least.
> >
> > > why do you need a module parameter?
> >
> > If we enable it unconditionally for modern devices, it may break some
> > buggy moden device (vsock without a fix as an example).
> >
> > >
> > >
> > > > >
> > > > > > > Second what's wrong with dma_desc_extra that we already maintain?
> > > > > > > Third motivation - it's part and parcel of the hardening effort 
> > > > > > > yes?
> > > > > >
> > > > > > They are different. dma_desc_extra is for a descriptor ring, but 
> > > > > > this
> > > > > > is for a used ring. Technically we can go back to iterate on the
> > > > > > descriptor ring for a legal used in buffer length. But it will have
> > > > > > worse performance.
> > > > >
> > > > > I don't really understand. We already iterate when we unmap -
> > > > > all that is necessary is to subtract it from used length, if at
> > > > > the end of the process it is >0 then we know used length is too
> > > > > large.
> > > >
> > > > Yes, but it is the job that is done in the driver level not the virtio
> > > > core.
> > >
> > > What job?
> >
> > I meant the driver can do the validation since it has the knowledge of
> > the buffer length if it wants.
> >
> > > unmap is done in detach_buf_split and detach_buf_packed respectively.
> > > vring_desc_extra isn't even visible outside drivers/virtio/virtio_ring.c
> >
> > 

Re: [PATCH] virtio_ring: validate used buffer length

2023-05-31 Thread Michael S. Tsirkin
On Wed, May 31, 2023 at 03:36:51PM +0800, Jason Wang wrote:
> On Wed, May 31, 2023 at 1:50 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, May 31, 2023 at 09:05:00AM +0800, Jason Wang wrote:
> > > On Mon, May 29, 2023 at 6:03 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Mon, May 29, 2023 at 09:18:10AM +0800, Jason Wang wrote:
> > > > > On Sun, May 28, 2023 at 3:57 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Fri, May 26, 2023 at 02:30:41PM +0800, Jason Wang wrote:
> > > > > > > This patch validate
> > > > > >
> > > > > > validates
> > > > > >
> > > > > > > the used buffer length provided by the device
> > > > > > > before trying to use it.
> > > > > >
> > > > > > before returning it to caller
> > > > > >
> > > > > > > This is done by remembering the in buffer
> > > > > > > length in a dedicated array during virtqueue_add(), then we can 
> > > > > > > fail
> > > > > > > the virtqueue_get_buf() when we find the device is trying to give 
> > > > > > > us a
> > > > > > > used buffer length which is greater than we stored before.
> > > > > >
> > > > > > than what we stored
> > > > > >
> > > > > > >
> > > > > > > This validation is disable
> > > > > >
> > > > > > disabled
> > > > > >
> > > > > > > by default via module parameter to unbreak
> > > > > > > some existing devices since some legacy devices are known to 
> > > > > > > report
> > > > > > > buggy used length.
> > > > > > >
> > > > > > > Signed-off-by: Jason Wang 
> > > > > >
> > > > > > First I'm not merging this without more data about
> > > > > > what is known to be broken and what is known to work well
> > > > > > in the commit log. And how exactly do things work if used length
> > > > > > is wrong?
> > > > >
> > > > > Assuming the device is malicious, it would be very hard to answer.
> > > > > Auditing and fuzzing won't cover every case. Instead of trying to seek
> > > > > the answer, we can simply make sure the used in buffer length is
> > > > > validated then we know we're fine or not.
> > > >
> > > > To restate the question, you said above "some legacy devices are known
> > > > to report buggy used length". If they report buggy length then how
> > > > can things work?
> > >
> > > The validation is disabled for legacy device (as stated in the changelog):
> > >
> > > static bool vring_needs_used_validation(const struct virtio_device *vdev)
> > > {
> > > /*
> > >  * Several legacy devices are known to produce buggy used
> > >  * length. In order to let driver work, we won't validate used
> > >  * buffer length in this case.
> > >  */
> > > if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
> > > return false;
> > > if (force_used_validation)
> > > return true;
> > > return false;
> > > }
> > >
> > > This seems to be what we've agreed in last version:
> > >
> > > https://lore.kernel.org/all/canlsykxfhamuu0bb4j7y6n4_g9odkxlcjxxgxex4sj6_kf+...@mail.gmail.com/T/#m31f3b06f9032beec175c312dfa2532cb08b15c56
> > >
> > > Thanks
> > >
> >
> > I don't get it. You wrote:
> >
> > This validation is disable
> > by default via module parameter to unbreak
> > some existing devices since some legacy devices are known to report
> > buggy used length.
> >
> > which devices?
> 
> legacy rpmsg and vsock device (before 49d8c5ffad07) at least.
> 
> > why do you need a module parameter?
> 
> If we enable it unconditionally for modern devices, it may break some
> buggy moden device (vsock without a fix as an example).

Presumably this happens because vsock does not have any inbuf at all
so it ignores the length.
We had the same with virtio net tx a long time ago.

My suggestion is then not to fail - just cap length at the dma length
set by driver. Another idea is that if dma len is 0 then don't validate
at all - driver that did not add any inbufs is not going to look
at length.

Allowing passing NULL as length and skipping validation
if len = 0 might also be a good idea.


> >
> >
> > > >
> > > > > > Second what's wrong with dma_desc_extra that we already maintain?
> > > > > > Third motivation - it's part and parcel of the hardening effort yes?
> > > > >
> > > > > They are different. dma_desc_extra is for a descriptor ring, but this
> > > > > is for a used ring. Technically we can go back to iterate on the
> > > > > descriptor ring for a legal used in buffer length. But it will have
> > > > > worse performance.
> > > >
> > > > I don't really understand. We already iterate when we unmap -
> > > > all that is necessary is to subtract it from used length, if at
> > > > the end of the process it is >0 then we know used length is too
> > > > large.
> > >
> > > Yes, but it is the job that is done in the driver level not the virtio
> > > core.
> >
> > What job?
> 
> I meant the driver can do the validation since it has the knowledge of
> the buffer length if it wants.

It does not necessarily have it - not if virtio is doing 

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-31 Thread Oleg Nesterov
On 05/31, Jason Wang wrote:
>
> On Wed, May 31, 2023 at 3:25 PM Oleg Nesterov  wrote:
> >
> > On 05/31, Jason Wang wrote:
> > >
> > > 在 2023/5/23 20:15, Oleg Nesterov 写道:
> > > >
> > > > /* make sure flag is seen after deletion */
> > > > smp_wmb();
> > > > llist_for_each_entry_safe(work, work_next, node, node) {
> > > > clear_bit(VHOST_WORK_QUEUED, &work->flags);
> > > >
> > > >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> > > >vhost_work_queue() can add this work again and change work->node->next.
> > > >
> > > >That is why we use _safe, but we need to ensure that 
> > > >llist_for_each_safe()
> > > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
> > >
> > > This should be fine since store is not speculated, so work->node->next 
> > > needs
> > > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop 
> > > condition.
> >
> > I don't understand you. OK, to simplify, suppose we have 2 global vars
> >
> > void *PTR = something_non_null;
> > unsigned long FLAGS = -1ul;
> >
> > Now I think this code
> >
> > CPU_0   CPU_1
> >
> > void *ptr = PTR;if (!test_and_set_bit(0, FLAGS))
> > clear_bit(0, FLAGS);PTR = NULL;
> > BUG_ON(!ptr);
> >
> > is racy and can hit the BUG_ON(!ptr).
>
> This seems different to the above case?

not sure,

> And you can hit BUG_ON with
> the following execution sequence:
>
> [cpu 0] clear_bit(0, FLAGS);
> [cpu 1] if (!test_and_set_bit(0, FLAGS))
> [cpu 1] PTR = NULL;
> [cpu 0] BUG_ON(!ptr)

I don't understand this part... yes, we can hit this BUG_ON() without mb in
between, this is what I tried to say.

> In vhost code, there's a condition before the clear_bit() which sits
> inside llist_for_each_entry_safe():
>
> #define llist_for_each_entry_safe(pos, n, node, member)   
>  \
> for (pos = llist_entry((node), typeof(*pos), member); 
>  \
>  member_address_is_nonnull(pos, member) &&
>  \
> (n = llist_entry(pos->member.next, typeof(*n), member), 
> true); \
>  pos = n)
>
> The clear_bit() is a store which is not speculated, so there's a
> control dependency, the store can't be executed until the condition
> expression is evaluated which requires pos->member.next
> (work->node.next) to be loaded.

But llist_for_each_entry_safe() doesn't check "n", I mean, it is not that we 
have
something like

n = llist_entry(...);
if (n)
clear_bit(...);

so I do not see how can we rely on the load-store control dependency.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_ring: validate used buffer length

2023-05-31 Thread Jason Wang
On Wed, May 31, 2023 at 3:36 PM Jason Wang  wrote:
>
> On Wed, May 31, 2023 at 1:50 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, May 31, 2023 at 09:05:00AM +0800, Jason Wang wrote:
> > > On Mon, May 29, 2023 at 6:03 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Mon, May 29, 2023 at 09:18:10AM +0800, Jason Wang wrote:
> > > > > On Sun, May 28, 2023 at 3:57 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Fri, May 26, 2023 at 02:30:41PM +0800, Jason Wang wrote:
> > > > > > > This patch validate
> > > > > >
> > > > > > validates
> > > > > >
> > > > > > > the used buffer length provided by the device
> > > > > > > before trying to use it.
> > > > > >
> > > > > > before returning it to caller
> > > > > >
> > > > > > > This is done by remembering the in buffer
> > > > > > > length in a dedicated array during virtqueue_add(), then we can 
> > > > > > > fail
> > > > > > > the virtqueue_get_buf() when we find the device is trying to give 
> > > > > > > us a
> > > > > > > used buffer length which is greater than we stored before.
> > > > > >
> > > > > > than what we stored
> > > > > >
> > > > > > >
> > > > > > > This validation is disable
> > > > > >
> > > > > > disabled
> > > > > >
> > > > > > > by default via module parameter to unbreak
> > > > > > > some existing devices since some legacy devices are known to 
> > > > > > > report
> > > > > > > buggy used length.
> > > > > > >
> > > > > > > Signed-off-by: Jason Wang 
> > > > > >
> > > > > > First I'm not merging this without more data about
> > > > > > what is known to be broken and what is known to work well
> > > > > > in the commit log. And how exactly do things work if used length
> > > > > > is wrong?
> > > > >
> > > > > Assuming the device is malicious, it would be very hard to answer.
> > > > > Auditing and fuzzing won't cover every case. Instead of trying to seek
> > > > > the answer, we can simply make sure the used in buffer length is
> > > > > validated then we know we're fine or not.
> > > >
> > > > To restate the question, you said above "some legacy devices are known
> > > > to report buggy used length". If they report buggy length then how
> > > > can things work?
> > >
> > > The validation is disabled for legacy device (as stated in the changelog):
> > >
> > > static bool vring_needs_used_validation(const struct virtio_device *vdev)
> > > {
> > > /*
> > >  * Several legacy devices are known to produce buggy used
> > >  * length. In order to let driver work, we won't validate used
> > >  * buffer length in this case.
> > >  */
> > > if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
> > > return false;
> > > if (force_used_validation)
> > > return true;
> > > return false;
> > > }
> > >
> > > This seems to be what we've agreed in last version:
> > >
> > > https://lore.kernel.org/all/canlsykxfhamuu0bb4j7y6n4_g9odkxlcjxxgxex4sj6_kf+...@mail.gmail.com/T/#m31f3b06f9032beec175c312dfa2532cb08b15c56
> > >
> > > Thanks
> > >
> >
> > I don't get it. You wrote:
> >
> > This validation is disable
> > by default via module parameter to unbreak
> > some existing devices since some legacy devices are known to report
> > buggy used length.
> >
> > which devices?
>
> legacy rpmsg and vsock device (before 49d8c5ffad07) at least.
>
> > why do you need a module parameter?
>
> If we enable it unconditionally for modern devices, it may break some
> buggy moden device (vsock without a fix as an example).
>
> >
> >
> > > >
> > > > > > Second what's wrong with dma_desc_extra that we already maintain?
> > > > > > Third motivation - it's part and parcel of the hardening effort yes?
> > > > >
> > > > > They are different. dma_desc_extra is for a descriptor ring, but this
> > > > > is for a used ring. Technically we can go back to iterate on the
> > > > > descriptor ring for a legal used in buffer length. But it will have
> > > > > worse performance.
> > > >
> > > > I don't really understand. We already iterate when we unmap -
> > > > all that is necessary is to subtract it from used length, if at
> > > > the end of the process it is >0 then we know used length is too
> > > > large.
> > >
> > > Yes, but it is the job that is done in the driver level not the virtio
> > > core.
> >
> > What job?
>
> I meant the driver can do the validation since it has the knowledge of
> the buffer length if it wants.
>
> > unmap is done in detach_buf_split and detach_buf_packed respectively.
> > vring_desc_extra isn't even visible outside drivers/virtio/virtio_ring.c
>
> desc_extra doesn't contain buffer length for the case of indirect
> descriptors. So we need to iterate in the descriptors when it looks
> expensive if we don't need unmap.
>
> Thanks
>
> >
> > For drivers that do unmap at driver level - I guess they can do
> > validation there too.
> >
> > > Validation in virtio core is still necessary since they're
> > > working at diffe

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-31 Thread Jason Wang
On Wed, May 31, 2023 at 3:25 PM Oleg Nesterov  wrote:
>
> On 05/31, Jason Wang wrote:
> >
> > 在 2023/5/23 20:15, Oleg Nesterov 写道:
> > >
> > > /* make sure flag is seen after deletion */
> > > smp_wmb();
> > > llist_for_each_entry_safe(work, work_next, node, node) {
> > > clear_bit(VHOST_WORK_QUEUED, &work->flags);
> > >
> > >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> > >vhost_work_queue() can add this work again and change work->node->next.
> > >
> > >That is why we use _safe, but we need to ensure that llist_for_each_safe()
> > >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
> >
> > This should be fine since store is not speculated, so work->node->next needs
> > to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition.
>
> I don't understand you. OK, to simplify, suppose we have 2 global vars
>
> void *PTR = something_non_null;
> unsigned long FLAGS = -1ul;
>
> Now I think this code
>
> CPU_0   CPU_1
>
> void *ptr = PTR;if (!test_and_set_bit(0, FLAGS))
> clear_bit(0, FLAGS);PTR = NULL;
> BUG_ON(!ptr);
>
> is racy and can hit the BUG_ON(!ptr).

This seems different to the above case? And you can hit BUG_ON with
the following execution sequence:

[cpu 0] clear_bit(0, FLAGS);
[cpu 1] if (!test_and_set_bit(0, FLAGS))
[cpu 1] PTR = NULL;
[cpu 0] BUG_ON(!ptr)

In vhost code, there's a condition before the clear_bit() which sits
inside llist_for_each_entry_safe():

#define llist_for_each_entry_safe(pos, n, node, member)\
for (pos = llist_entry((node), typeof(*pos), member);  \
 member_address_is_nonnull(pos, member) && \
(n = llist_entry(pos->member.next, typeof(*n), member), true); \
 pos = n)

The clear_bit() is a store which is not speculated, so there's a
control dependency, the store can't be executed until the condition
expression is evaluated which requires pos->member.next
(work->node.next) to be loaded.

>
> I guess it is fine on x86, but in general you need smp_mb__before_atomic()
> before clear_bit(), or clear_bit_unlock().
>
> > > __set_current_state(TASK_RUNNING);
> > >
> > >Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
> > >can return with current->state != RUNNING ?
> >
> > It is because the state were set to TASK_INTERRUPTIBLE in the beginning of
> > the loop otherwise it might be side effect while executing work->fn().
>
> Again, I don't understand you. So let me repeat: can work->fn() return with
> current->_state != TASK_RUNNING ? If not (and I'd say it should not), you can
> do __set_current_state(TASK_RUNNING) once, before llist_for_each_entry_safe().
>

Ok, that should be fine.

Thanks


> > >Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> > >before we call work->fn(). Is it "safe" to run this callback with
> > >signal_pending() or fatal_signal_pending() ?
> >
> > It looks safe since:
> >
> > 1) vhost hold refcnt of the mm
> > 2) release will sync with the worker
>
> Well, that's not what I asked... nevermind, please forget.
>
> Thanks.
>
> Oleg.
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] virtio_ring: validate used buffer length

2023-05-31 Thread Jason Wang
On Wed, May 31, 2023 at 1:50 PM Michael S. Tsirkin  wrote:
>
> On Wed, May 31, 2023 at 09:05:00AM +0800, Jason Wang wrote:
> > On Mon, May 29, 2023 at 6:03 PM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, May 29, 2023 at 09:18:10AM +0800, Jason Wang wrote:
> > > > On Sun, May 28, 2023 at 3:57 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Fri, May 26, 2023 at 02:30:41PM +0800, Jason Wang wrote:
> > > > > > This patch validate
> > > > >
> > > > > validates
> > > > >
> > > > > > the used buffer length provided by the device
> > > > > > before trying to use it.
> > > > >
> > > > > before returning it to caller
> > > > >
> > > > > > This is done by remembering the in buffer
> > > > > > length in a dedicated array during virtqueue_add(), then we can fail
> > > > > > the virtqueue_get_buf() when we find the device is trying to give 
> > > > > > us a
> > > > > > used buffer length which is greater than we stored before.
> > > > >
> > > > > than what we stored
> > > > >
> > > > > >
> > > > > > This validation is disable
> > > > >
> > > > > disabled
> > > > >
> > > > > > by default via module parameter to unbreak
> > > > > > some existing devices since some legacy devices are known to report
> > > > > > buggy used length.
> > > > > >
> > > > > > Signed-off-by: Jason Wang 
> > > > >
> > > > > First I'm not merging this without more data about
> > > > > what is known to be broken and what is known to work well
> > > > > in the commit log. And how exactly do things work if used length
> > > > > is wrong?
> > > >
> > > > Assuming the device is malicious, it would be very hard to answer.
> > > > Auditing and fuzzing won't cover every case. Instead of trying to seek
> > > > the answer, we can simply make sure the used in buffer length is
> > > > validated then we know we're fine or not.
> > >
> > > To restate the question, you said above "some legacy devices are known
> > > to report buggy used length". If they report buggy length then how
> > > can things work?
> >
> > The validation is disabled for legacy device (as stated in the changelog):
> >
> > static bool vring_needs_used_validation(const struct virtio_device *vdev)
> > {
> > /*
> >  * Several legacy devices are known to produce buggy used
> >  * length. In order to let driver work, we won't validate used
> >  * buffer length in this case.
> >  */
> > if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1))
> > return false;
> > if (force_used_validation)
> > return true;
> > return false;
> > }
> >
> > This seems to be what we've agreed in last version:
> >
> > https://lore.kernel.org/all/canlsykxfhamuu0bb4j7y6n4_g9odkxlcjxxgxex4sj6_kf+...@mail.gmail.com/T/#m31f3b06f9032beec175c312dfa2532cb08b15c56
> >
> > Thanks
> >
>
> I don't get it. You wrote:
>
> This validation is disable
> by default via module parameter to unbreak
> some existing devices since some legacy devices are known to report
> buggy used length.
>
> which devices?

legacy rpmsg and vsock device (before 49d8c5ffad07) at least.

> why do you need a module parameter?

If we enable it unconditionally for modern devices, it may break some
buggy moden device (vsock without a fix as an example).

>
>
> > >
> > > > > Second what's wrong with dma_desc_extra that we already maintain?
> > > > > Third motivation - it's part and parcel of the hardening effort yes?
> > > >
> > > > They are different. dma_desc_extra is for a descriptor ring, but this
> > > > is for a used ring. Technically we can go back to iterate on the
> > > > descriptor ring for a legal used in buffer length. But it will have
> > > > worse performance.
> > >
> > > I don't really understand. We already iterate when we unmap -
> > > all that is necessary is to subtract it from used length, if at
> > > the end of the process it is >0 then we know used length is too
> > > large.
> >
> > Yes, but it is the job that is done in the driver level not the virtio
> > core.
>
> What job?

I meant the driver can do the validation since it has the knowledge of
the buffer length if it wants.

> unmap is done in detach_buf_split and detach_buf_packed respectively.
> vring_desc_extra isn't even visible outside drivers/virtio/virtio_ring.c

desc_extra doesn't contain buffer length for the case of indirect
descriptors. So we need to iterate in the descriptors when it looks
expensive if we don't need unmap.

Thanks

>
> For drivers that do unmap at driver level - I guess they can do
> validation there too.
>
> > Validation in virtio core is still necessary since they're
> > working at different levels and it's hard to force the validation in
> > all drivers by codes. Last version introduces a
> > suppress_driver_validation to allow the driver to suppress the core
> > validation which seems not good, we need a way to force the
> > virtio_ring code to do validation before.
>
> Why do we? If driver validates length virtio_ring 

Re: [syzbot] [kvm?] [net?] [virt?] general protection fault in vhost_work_queue

2023-05-31 Thread Stefano Garzarella
On Tue, May 30, 2023 at 6:30 PM  wrote:
>
> On 5/30/23 11:17 AM, Stefano Garzarella wrote:
> > On Tue, May 30, 2023 at 11:09:09AM -0500, Mike Christie wrote:
> >> On 5/30/23 11:00 AM, Stefano Garzarella wrote:
> >>> I think it is partially related to commit 6e890c5d5021 ("vhost: use
> >>> vhost_tasks for worker threads") and commit 1a5f8090c6de ("vhost: move
> >>> worker thread fields to new struct"). Maybe that commits just
> >>> highlighted the issue and it was already existing.
> >>
> >> See my mail about the crash. Agree with your analysis about worker->vtsk
> >> not being set yet. It's a bug from my commit where I should have not set
> >> it so early or I should be checking for
> >>
> >> if (dev->worker && worker->vtsk)
> >>
> >> instead of
> >>
> >> if (dev->worker)
> >
> > Yes, though, in my opinion the problem may persist depending on how the
> > instructions are reordered.
>
> Ah ok.
>
> >
> > Should we protect dev->worker() with an RCU to be safe?
>
> For those multiple worker patchsets Jason had asked me about supporting
> where we don't have a worker while we are swapping workers around. To do
> that I had added rcu around the dev->worker. I removed it in later patchsets
> because I didn't think anyone would use it.
>
> rcu would work for your case and for what Jason had requested.

Yeah, so you already have some patches?

Do you want to send it to solve this problem?

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-31 Thread Oleg Nesterov
On 05/31, Jason Wang wrote:
>
> 在 2023/5/23 20:15, Oleg Nesterov 写道:
> >
> > /* make sure flag is seen after deletion */
> > smp_wmb();
> > llist_for_each_entry_safe(work, work_next, node, node) {
> > clear_bit(VHOST_WORK_QUEUED, &work->flags);
> >
> >I am not sure about smp_wmb + clear_bit. Once we clear VHOST_WORK_QUEUED,
> >vhost_work_queue() can add this work again and change work->node->next.
> >
> >That is why we use _safe, but we need to ensure that llist_for_each_safe()
> >completes LOAD(work->node->next) before VHOST_WORK_QUEUED is cleared.
>
> This should be fine since store is not speculated, so work->node->next needs
> to be loaded before VHOST_WORK_QUEUED is cleared to meet the loop condition.

I don't understand you. OK, to simplify, suppose we have 2 global vars

void *PTR = something_non_null;
unsigned long FLAGS = -1ul;

Now I think this code

CPU_0   CPU_1

void *ptr = PTR;if (!test_and_set_bit(0, FLAGS))
clear_bit(0, FLAGS);PTR = NULL;
BUG_ON(!ptr);

is racy and can hit the BUG_ON(!ptr).

I guess it is fine on x86, but in general you need smp_mb__before_atomic()
before clear_bit(), or clear_bit_unlock().

> > __set_current_state(TASK_RUNNING);
> >
> >Why do we set TASK_RUNNING inside the loop? Does this mean that work->fn()
> >can return with current->state != RUNNING ?
>
> It is because the state were set to TASK_INTERRUPTIBLE in the beginning of
> the loop otherwise it might be side effect while executing work->fn().

Again, I don't understand you. So let me repeat: can work->fn() return with
current->_state != TASK_RUNNING ? If not (and I'd say it should not), you can
do __set_current_state(TASK_RUNNING) once, before llist_for_each_entry_safe().

> >Now the main question. Whatever we do, SIGKILL/SIGSTOP/etc can come right
> >before we call work->fn(). Is it "safe" to run this callback with
> >signal_pending() or fatal_signal_pending() ?
>
> It looks safe since:
>
> 1) vhost hold refcnt of the mm
> 2) release will sync with the worker

Well, that's not what I asked... nevermind, please forget.

Thanks.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization