Re: [PATCH 1/2] vdpa: Add support for querying statistics

2021-11-22 Thread Jason Wang
On Mon, Nov 22, 2021 at 3:56 PM Eli Cohen  wrote:
>
> On Mon, Nov 22, 2021 at 10:30:12AM +0800, Jason Wang wrote:
> > On Fri, Nov 19, 2021 at 11:09 AM Parav Pandit  wrote:
> > >
> > >
> > >
> > > > From: Jason Wang 
> > > > Sent: Friday, November 19, 2021 8:12 AM
> > > >
> > > > On Thu, Nov 18, 2021 at 1:58 PM Eli Cohen  wrote:
> > > > >
> > > > > Add support for querying virtqueue statistics. Supported statistics 
> > > > > are:
> > > > >
> > > > > Received_desc - number of descriptors received for the virtqueue
> > > > > completed_desc - number of descriptors completed for the virtqueue
> > > > >
> > > > > A new callback was added to vdpa_config_ops which provides the means
> > > > > for the vdpa driver to return statistics results.
> > > > >
> > > > > The interface allows for reading all the supported virtqueues,
> > > > > including the control virtqueue if it exists, by returning the next
> > > > > queue index to query.
> > > > >
> > > > > Examples:
> > > > > 1. Read statisitics for the virtqueue at index 1 $ vdpa dev stats show
> > > > > vdpa-a index 1
> > > > > vdpa-a:
> > > > > index 1 tx received_desc 21 completed_desc 21
> > > > >
> > > > > 2. Read statisitics for all the virtqueues vdpa dev stats show vdpa-a
> > > > > vdpa-a:
> > > > > index 0 rx received_desc 256 completed_desc 12 index 1 tx
> > > > > received_desc 21 completed_desc 21 index 2 ctrl received_desc 0
> > > > > completed_desc 0
> > > >
> > > > Adding Adrian and Laurent.
> > > >
> > > > It's quite useful but I think it's vendor specific statistics.
> > > These are vdpa device specific of Linux.
> > > And are very generic of the VQ for all device types.
> >
> > The question is what happens if the parent doesn't implement those 
> > statistics.
> >
>
> Are you suggesting that some parents may support reporting a subeset of
> the fields and we should maybe indicate what is actually reported?

It's an open question. E.g do we want a stable API for the those
statistics counter? If yes, it's better to not to let the parents to
report a subset but then it forces the exact same counters to be
supported by other vendors. If not, it should be fine. For any case, I
think it's simpler to start with "vendor-stats" and we can switch it
to a standard way if it was agreed by every vendor.

>
> > >
> > > > I wonder if it's better
> > > > to use "vendor" prefix in the protocol, then we use this instead:
> > > >
> > > > vdpa dev vendor-stats show vdpa-a
> > > >
> > > May be. Lets evaluate if stats of this patch are generic enough or not.
> > >
> > > > Or if we want to make it standard is exposing virtio index better?
> > > >
> > > > qid 0 last_avail_idx X avail_idx Y last_used_idx M used_idx N
> > > >
> > > I did consider this option a while back. Shows indices are equally useful.
> > > I think we should show that as vq info, along with other VQ attributes 
> > > (addr, len).
> >
> > That may work but it looks to me the receiced_desc/completed_desc is
> > also per vq.
> >
> > Another question is that is it more useful to use buffers instead of
> > descriptors? E.g how indirect descriptors are counted.
> >
>
> I think it's descriptors so indirect descriptors are counted once but I
> need to verify that.
>
> > > $ vdpa dev show vq
> > >
> > > But showing indices are not less statistics and more current state of the 
> > > queue.
> > > For example roll over of the indices won't cover absolute number of 
> > > descriptors processed for the queue.
> > > And even if we make them u64 (not good), non_developer end user needs to 
> > > keep using side calculator to count the delta.
> >
> > How about exposing those raw indices via the protocol and letting the
> > vdpa tool calculate for us?
> >
>
> counters are 16 bit per the virtio spec so I don't know how you could
> handle rolover without losing information.

So at most 1 rollover I guess. So it's not hard by comparing the
indices. E.g the if last_avil_idx(device avail idx) > avail_idx, we
know there's one?

Thanks

>
> > Thanks
> >
> > >
> > > So I think useful q indices belong to q info.
> > >
> >
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 02/10] fork/vm: Move common PF_IO_WORKER behavior to new flag

2021-11-22 Thread Geert Uytterhoeven
On Sun, Nov 21, 2021 at 6:49 PM Mike Christie
 wrote:
> This adds a new flag, PF_USER_WORKER, that's used for behavior common to
> to both PF_IO_WORKER and users like vhost which will use the new
> kernel_worker helpers that will use the flag and are added later in this
> patchset.
>
> The common behavior PF_USER_WORKER covers is the initial frame and fpu
> setup and the vm reclaim handling.
>
> Signed-off-by: Mike Christie 

>  arch/m68k/kernel/process.c   | 2 +-

Acked-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] vdpa: Add support for querying statistics

2021-11-22 Thread Jason Wang
On Mon, Nov 22, 2021 at 5:08 PM Eli Cohen  wrote:
>
> On Mon, Nov 22, 2021 at 04:07:24PM +0800, Jason Wang wrote:
> > On Mon, Nov 22, 2021 at 3:56 PM Eli Cohen  wrote:
> > >
> > > On Mon, Nov 22, 2021 at 10:30:12AM +0800, Jason Wang wrote:
> > > > On Fri, Nov 19, 2021 at 11:09 AM Parav Pandit  wrote:
> > > > >
> > > > >
> > > > >
> > > > > > From: Jason Wang 
> > > > > > Sent: Friday, November 19, 2021 8:12 AM
> > > > > >
> > > > > > On Thu, Nov 18, 2021 at 1:58 PM Eli Cohen  wrote:
> > > > > > >
> > > > > > > Add support for querying virtqueue statistics. Supported 
> > > > > > > statistics are:
> > > > > > >
> > > > > > > Received_desc - number of descriptors received for the virtqueue
> > > > > > > completed_desc - number of descriptors completed for the virtqueue
> > > > > > >
> > > > > > > A new callback was added to vdpa_config_ops which provides the 
> > > > > > > means
> > > > > > > for the vdpa driver to return statistics results.
> > > > > > >
> > > > > > > The interface allows for reading all the supported virtqueues,
> > > > > > > including the control virtqueue if it exists, by returning the 
> > > > > > > next
> > > > > > > queue index to query.
> > > > > > >
> > > > > > > Examples:
> > > > > > > 1. Read statisitics for the virtqueue at index 1 $ vdpa dev stats 
> > > > > > > show
> > > > > > > vdpa-a index 1
> > > > > > > vdpa-a:
> > > > > > > index 1 tx received_desc 21 completed_desc 21
> > > > > > >
> > > > > > > 2. Read statisitics for all the virtqueues vdpa dev stats show 
> > > > > > > vdpa-a
> > > > > > > vdpa-a:
> > > > > > > index 0 rx received_desc 256 completed_desc 12 index 1 tx
> > > > > > > received_desc 21 completed_desc 21 index 2 ctrl received_desc 0
> > > > > > > completed_desc 0
> > > > > >
> > > > > > Adding Adrian and Laurent.
> > > > > >
> > > > > > It's quite useful but I think it's vendor specific statistics.
> > > > > These are vdpa device specific of Linux.
> > > > > And are very generic of the VQ for all device types.
> > > >
> > > > The question is what happens if the parent doesn't implement those 
> > > > statistics.
> > > >
> > >
> > > Are you suggesting that some parents may support reporting a subeset of
> > > the fields and we should maybe indicate what is actually reported?
> >
> > It's an open question. E.g do we want a stable API for the those
> > statistics counter? If yes, it's better to not to let the parents to
> > report a subset but then it forces the exact same counters to be
> > supported by other vendors. If not, it should be fine. For any case, I
> > think it's simpler to start with "vendor-stats" and we can switch it
> > to a standard way if it was agreed by every vendor.
> >
>
> received and completed descriptors are very basic and I assume any
> vendor would support those.

At least it was not supported by virtio.

> If we go with vendor stats, how can we
> communicate the information to userspace? Currenlty we use netlink
> attributes defined to pass this information.

It can be done exactly as what have been done in the patch, we can
document it as vendor stats.

>
> > >
> > > > >
> > > > > > I wonder if it's better
> > > > > > to use "vendor" prefix in the protocol, then we use this instead:
> > > > > >
> > > > > > vdpa dev vendor-stats show vdpa-a
> > > > > >
> > > > > May be. Lets evaluate if stats of this patch are generic enough or 
> > > > > not.
> > > > >
> > > > > > Or if we want to make it standard is exposing virtio index better?
> > > > > >
> > > > > > qid 0 last_avail_idx X avail_idx Y last_used_idx M used_idx N
> > > > > >
> > > > > I did consider this option a while back. Shows indices are equally 
> > > > > useful.
> > > > > I think we should show that as vq info, along with other VQ 
> > > > > attributes (addr, len).
> > > >
> > > > That may work but it looks to me the receiced_desc/completed_desc is
> > > > also per vq.
> > > >
> > > > Another question is that is it more useful to use buffers instead of
> > > > descriptors? E.g how indirect descriptors are counted.
> > > >
> > >
> > > I think it's descriptors so indirect descriptors are counted once but I
> > > need to verify that.
> > >
> > > > > $ vdpa dev show vq
> > > > >
> > > > > But showing indices are not less statistics and more current state of 
> > > > > the queue.
> > > > > For example roll over of the indices won't cover absolute number of 
> > > > > descriptors processed for the queue.
> > > > > And even if we make them u64 (not good), non_developer end user needs 
> > > > > to keep using side calculator to count the delta.
> > > >
> > > > How about exposing those raw indices via the protocol and letting the
> > > > vdpa tool calculate for us?
> > > >
> > >
> > > counters are 16 bit per the virtio spec so I don't know how you could
> > > handle rolover without losing information.
> >
> > So at most 1 rollover I guess. So it's not hard by comparing the
> > indices. E.g the if last_avil_idx(device avail idx) > avail_idx, we
> > know there's one?
> >
>
> I am no

[PATCH] vsock/virtio: suppress used length validation

2021-11-22 Thread Michael S. Tsirkin
It turns out that vhost vsock violates the virtio spec
by supplying the out buffer length in the used length
(should just be the in length).
As a result, attempts to validate the used length fail with:
vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in buflen 0

Since vsock driver does not use the length fox tx and
validates the length before use for rx, it is safe to
suppress the validation in virtio core for this driver.

Reported-by: Halil Pasic 
Fixes: 939779f5152d ("virtio_ring: validate used buffer length")
Cc: "Jason Wang" 
Signed-off-by: Michael S. Tsirkin 
---
 net/vmw_vsock/virtio_transport.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 4f7c99dfd16c..3f82b2f1e6dd 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -731,6 +731,7 @@ static unsigned int features[] = {
 static struct virtio_driver virtio_vsock_driver = {
.feature_table = features,
.feature_table_size = ARRAY_SIZE(features),
+   .suppress_used_validation = true,
.driver.name = KBUILD_MODNAME,
.driver.owner = THIS_MODULE,
.id_table = id_table,
-- 
MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 6/6] vhost_net: use RCU callbacks instead of synchronize_rcu()

2021-11-22 Thread Michael S. Tsirkin
On Fri, Nov 19, 2021 at 02:32:05PM +0300, Andrey Ryabinin wrote:
> 
> 
> On 11/16/21 8:00 AM, Jason Wang wrote:
> > On Mon, Nov 15, 2021 at 11:32 PM Andrey Ryabinin  
> > wrote:
> >>
> >> Currently vhost_net_release() uses synchronize_rcu() to synchronize
> >> freeing with vhost_zerocopy_callback(). However synchronize_rcu()
> >> is quite costly operation. It take more than 10 seconds
> >> to shutdown qemu launched with couple net devices like this:
> >> -netdev tap,id=tap0,..,vhost=on,queues=80
> >> because we end up calling synchronize_rcu() netdev_count*queues times.
> >>
> >> Free vhost net structures in rcu callback instead of using
> >> synchronize_rcu() to fix the problem.
> > 
> > I admit the release code is somehow hard to understand. But I wonder
> > if the following case can still happen with this:
> > 
> > CPU 0 (vhost_dev_cleanup)   CPU1
> > (vhost_net_zerocopy_callback()->vhost_work_queue())
> > if (!dev->worker)
> > dev->worker = NULL
> > 
> > wake_up_process(dev->worker)
> > 
> > If this is true. It seems the fix is to move RCU synchronization stuff
> > in vhost_net_ubuf_put_and_wait()?
> > 
> 
> It all depends whether vhost_zerocopy_callback() can be called outside of 
> vhost
> thread context or not. If it can run after vhost thread stopped, than the 
> race you
> describe seems possible and the fix in commit b0c057ca7e83 ("vhost: fix a 
> theoretical race in device cleanup")
> wasn't complete. I would fix it by calling synchronize_rcu() after 
> vhost_net_flush()
> and before vhost_dev_cleanup().
> 
> As for the performance problem, it can be solved by replacing 
> synchronize_rcu() with synchronize_rcu_expedited().

expedited causes a stop of IPIs though, so it's problematic to
do it upon a userspace syscall.

> But now I'm not sure that this race is actually exists and that 
> synchronize_rcu() needed at all.
> I did a bit of testing and I only see callback being called from vhost thread:
> 
> vhost-3724  3733 [002]  2701.768731: probe:vhost_zerocopy_callback: 
> (81af8c10)
> 81af8c11 vhost_zerocopy_callback+0x1 ([kernel.kallsyms])
> 81bb34f6 skb_copy_ubufs+0x256 ([kernel.kallsyms])
> 81bce621 __netif_receive_skb_core.constprop.0+0xac1 
> ([kernel.kallsyms])
> 81bd062d __netif_receive_skb_one_core+0x3d ([kernel.kallsyms])
> 81bd0748 netif_receive_skb+0x38 ([kernel.kallsyms])
> 819a2a1e tun_get_user+0xdce ([kernel.kallsyms])
> 819a2cf4 tun_sendmsg+0xa4 ([kernel.kallsyms])
> 81af9229 handle_tx_zerocopy+0x149 ([kernel.kallsyms])
> 81afaf05 handle_tx+0xc5 ([kernel.kallsyms])
> 81afce86 vhost_worker+0x76 ([kernel.kallsyms])
> 811581e9 kthread+0x169 ([kernel.kallsyms])
> 810018cf ret_from_fork+0x1f ([kernel.kallsyms])
>0 [unknown] ([unknown])
> 
> This means that the callback can't run after kthread_stop() in 
> vhost_dev_cleanup() and no synchronize_rcu() needed.
> 
> I'm not confident that my quite limited testing cover all possible 
> vhost_zerocopy_callback() callstacks.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 1/2] vdpa: Add support for querying statistics

2021-11-22 Thread Parav Pandit via Virtualization



> From: Jason Wang 
> Sent: Monday, November 22, 2021 8:00 AM
> 
> On Fri, Nov 19, 2021 at 11:09 AM Parav Pandit  wrote:
> >
> >
> >
> > > From: Jason Wang 
> > > Sent: Friday, November 19, 2021 8:12 AM
> > >
> > > On Thu, Nov 18, 2021 at 1:58 PM Eli Cohen  wrote:
> > > >
> > > > Add support for querying virtqueue statistics. Supported statistics are:
> > > >
> > > > Received_desc - number of descriptors received for the virtqueue
> > > > completed_desc - number of descriptors completed for the virtqueue
> > > >
> > > > A new callback was added to vdpa_config_ops which provides the
> > > > means for the vdpa driver to return statistics results.
> > > >
> > > > The interface allows for reading all the supported virtqueues,
> > > > including the control virtqueue if it exists, by returning the
> > > > next queue index to query.
> > > >
> > > > Examples:
> > > > 1. Read statisitics for the virtqueue at index 1 $ vdpa dev stats
> > > > show vdpa-a index 1
> > > > vdpa-a:
> > > > index 1 tx received_desc 21 completed_desc 21
> > > >
> > > > 2. Read statisitics for all the virtqueues vdpa dev stats show
> > > > vdpa-a
> > > > vdpa-a:
> > > > index 0 rx received_desc 256 completed_desc 12 index 1 tx
> > > > received_desc 21 completed_desc 21 index 2 ctrl received_desc 0
> > > > completed_desc 0
> > >
> > > Adding Adrian and Laurent.
> > >
> > > It's quite useful but I think it's vendor specific statistics.
> > These are vdpa device specific of Linux.
> > And are very generic of the VQ for all device types.
> 
> The question is what happens if the parent doesn't implement those statistics.
Only those statistics to be filled up by the vendor driver that it implements.
If parent doesn't implement any of it, usual EUNSUPPORT is returned.


> 
> >
> > > I wonder if it's better
> > > to use "vendor" prefix in the protocol, then we use this instead:
> > >
> > > vdpa dev vendor-stats show vdpa-a
> > >
> > May be. Lets evaluate if stats of this patch are generic enough or not.
> >
> > > Or if we want to make it standard is exposing virtio index better?
> > >
> > > qid 0 last_avail_idx X avail_idx Y last_used_idx M used_idx N
> > >
> > I did consider this option a while back. Shows indices are equally useful.
> > I think we should show that as vq info, along with other VQ attributes 
> > (addr,
> len).
> 
> That may work but it looks to me the receiced_desc/completed_desc is also per
> vq.
> 
It is per VQ but statistics of a VQ and VQ config are two different things and 
also mostly consumed by different entities.
For example, prometheous and other similar entity will use VQ statistics.
Prometheous etc statistics collector often makes frequent queries for it for 
which a VQ address is of no use.

While VQ addr, indices etc will be more used by the developers/debuggers etc.

And it is better not to mix these different info under one netlink command.
 
> > $ vdpa dev show vq
> >
> > But showing indices are not less statistics and more current state of the
> queue.
> > For example roll over of the indices won't cover absolute number of
> descriptors processed for the queue.
> > And even if we make them u64 (not good), non_developer end user needs to
> keep using side calculator to count the delta.
> 
> How about exposing those raw indices via the protocol and letting the vdpa 
> tool
> calculate for us?
With wrap arounds it wont be able to calculate it.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH 1/2] vdpa: Add support for querying statistics

2021-11-22 Thread Parav Pandit via Virtualization



> From: Jason Wang 
> Sent: Monday, November 22, 2021 3:02 PM
> 
> > If we go with vendor stats, how can we communicate the information to
> > userspace? Currenlty we use netlink attributes defined to pass this
> > information.
> 
> It can be done exactly as what have been done in the patch, we can document
> it as vendor stats.
> 
Yes, attribute to have VENDOR_ prefix in it.
> 
> Ok, I think I get you. So I wonder if it's more useful to use device specific
> counters. For networking, it could be packets send/received etc.

Yes, I equally discussed this previously with Eli as its more meaningful for 
end users.
We just return the device id of it along with queue number that helps to show 
tx and rx.
For ctrl q, it is just ctrl commands and ctrl completions.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vsock/virtio: suppress used length validation

2021-11-22 Thread Stefano Garzarella

On Mon, Nov 22, 2021 at 04:32:01AM -0500, Michael S. Tsirkin wrote:

It turns out that vhost vsock violates the virtio spec
by supplying the out buffer length in the used length
(should just be the in length).
As a result, attempts to validate the used length fail with:
vmw_vsock_virtio_transport virtio1: tx: used len 44 is larger than in buflen 0

Since vsock driver does not use the length fox tx and
validates the length before use for rx, it is safe to
suppress the validation in virtio core for this driver.

Reported-by: Halil Pasic 
Fixes: 939779f5152d ("virtio_ring: validate used buffer length")
Cc: "Jason Wang" 
Signed-off-by: Michael S. Tsirkin 
---
net/vmw_vsock/virtio_transport.c | 1 +
1 file changed, 1 insertion(+)


Thanks for this fix

Reviewed-by: Stefano Garzarella 


I think we should also fix vhost-vsock violation (in stable branches 
too).

@Halil do you plan to send a fix? Otherwise I can do it ;-)

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-22 Thread Stefano Garzarella

On Mon, Nov 22, 2021 at 08:55:24AM +0100, Stefano Garzarella wrote:

On Mon, Nov 22, 2021 at 02:25:26PM +0800, Jason Wang wrote:

On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic  wrote:


On Mon, 22 Nov 2021 06:35:18 +0100
Halil Pasic  wrote:


> I think it should be a common issue, looking at
> vhost_vsock_handle_tx_kick(), it did:
>
> len += sizeof(pkt->hdr);
> vhost_add_used(vq, head, len);
>
> which looks like a violation of the spec since it's TX.

I'm not sure the lines above look like a violation of the spec. If you
examine vhost_vsock_alloc_pkt() I believe that you will agree that:
len == pkt->len == pkt->hdr.len
which makes sense since according to the spec both tx and rx messages
are hdr+payload. And I believe hdr.len is the size of the payload,
although that does not seem to be properly documented by the spec.


Sorry for being unclear, what I meant is that we probably should use
zero here. TX doesn't use in buffer actually.

According to the spec, 0 should be the used length:

"and len the total of bytes written into the buffer."



On the other hand tx messages are stated to be device read-only (in the
spec) so if the device writes stuff, that is certainly wrong.



Yes.


If that is what happens.

Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
happens. My hypothesis is that we just a last descriptor is an 'in'
type descriptor (i.e. a device writable one). For tx that assumption
would be wrong.

I will have another look at this today and send a fix patch if my
suspicion is confirmed.


If my suspicion is right something like:

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 00f64f2f8b72..efb57898920b 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
   struct vring_virtqueue *vq = to_vvq(_vq);
   void *ret;
   unsigned int i;
+   bool has_in;
   u16 last_used;

   START_USE(vq);
@@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
   vq->split.vring.used->ring[last_used].id);
   *len = virtio32_to_cpu(_vq->vdev,
   vq->split.vring.used->ring[last_used].len);
+   has_in = virtio16_to_cpu(_vq->vdev,
+   vq->split.vring.used->ring[last_used].flags)
+   & VRING_DESC_F_WRITE;


Did you mean vring.desc actually? If yes, it's better not depend on
the descriptor ring which can be modified by the device. We've stored
the flags in desc_extra[].



   if (unlikely(i >= vq->split.vring.num)) {
   BAD_RING(vq, "id %u out of range\n", i);
@@ -796,7 +800,7 @@ static void *virtqueue_get_buf_ctx_split(struct virtqueue 
*_vq,
   BAD_RING(vq, "id %u is not a head!\n", i);
   return NULL;
   }
-   if (vq->buflen && unlikely(*len > vq->buflen[i])) {
+   if (has_in && q->buflen && unlikely(*len > vq->buflen[i])) {
   BAD_RING(vq, "used len %d is larger than in buflen %u\n",
   *len, vq->buflen[i]);
   return NULL;

would fix the problem for split. I will try that out and let you know
later.


I'm not sure I get this, in virtqueue_add_split, the buflen[i] only
contains the in buffer length.

I think the fixes are:

1) fixing the vhost vsock


Yep, in vhost_vsock_handle_tx_kick() we should have vhost_add_used(vq, 
head, 0) since the device doesn't write anything.



2) use suppress_used_validation=true to let vsock driver to validate
the in buffer length
3) probably a new feature so the driver can only enable the validation
when the feature is enabled.


I fully agree with these steps.


Michael sent a patch to suppress the validation, so I think we should 
just fix vhost-vsock. I mean something like this:


diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 938aefbc75ec..4e3b95af7ee4 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -554,7 +554,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work 
*work)
virtio_transport_free_pkt(pkt);

len += sizeof(pkt->hdr);
-   vhost_add_used(vq, head, len);
+   vhost_add_used(vq, head, 0);
total_len += len;
added = true;
} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));

I checked and the problem is there from the first commit, so we should 
add:


Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko")

I tested this patch and it works even without suppressing validation in 
the virtio core.  But for backwards compatibility we have to suppress it 
for sure as Michael did.


Maybe we can have a patch just with this change to backport it easily 
and one after to clean up a bit the code that was added after (len, 
total_len).


@Halil Let me know if you want to do it, otherwise I can do it.

Stefano

__

Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-22 Thread Halil Pasic
On Mon, 22 Nov 2021 14:25:26 +0800
Jason Wang  wrote:

> On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic  wrote:
> >
> > On Mon, 22 Nov 2021 06:35:18 +0100
> > Halil Pasic  wrote:
> >  
> > > > I think it should be a common issue, looking at
> > > > vhost_vsock_handle_tx_kick(), it did:
> > > >
> > > > len += sizeof(pkt->hdr);
> > > > vhost_add_used(vq, head, len);
> > > >
> > > > which looks like a violation of the spec since it's TX.  
> > >
> > > I'm not sure the lines above look like a violation of the spec. If you
> > > examine vhost_vsock_alloc_pkt() I believe that you will agree that:
> > > len == pkt->len == pkt->hdr.len
> > > which makes sense since according to the spec both tx and rx messages
> > > are hdr+payload. And I believe hdr.len is the size of the payload,
> > > although that does not seem to be properly documented by the spec.  
> 
> Sorry for being unclear, what I meant is that we probably should use
> zero here. TX doesn't use in buffer actually.
> 
> According to the spec, 0 should be the used length:
> 
> "and len the total of bytes written into the buffer."

Right, I was wrong. I somehow assumed this is the total length and not
just the number of bytes written.

> 
> > >
> > > On the other hand tx messages are stated to be device read-only (in the
> > > spec) so if the device writes stuff, that is certainly wrong.
> > >  
> 
> Yes.
> 
> > > If that is what happens.
> > >
> > > Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
> > > happens. My hypothesis is that we just a last descriptor is an 'in'
> > > type descriptor (i.e. a device writable one). For tx that assumption
> > > would be wrong.
> > >
> > > I will have another look at this today and send a fix patch if my
> > > suspicion is confirmed.

Yeah, I didn't remember the semantic of
vq->split.vring.used->ring[last_used].len
correctly, and in fact also how exactly the rings work. So your objection
is correct. 

Maybe updating some stuff would make it easier to not make this mistake.

For example the spec and also the linux header says:

/* le32 is used here for ids for padding reasons. */ 
struct virtq_used_elem { 
/* Index of start of used descriptor chain. */ 
le32 id; 
/* Total length of the descriptor chain which was used (written to) */ 
le32 len; 
};

I think that comment isn't as clear as it could be. I would prefer:
/* The number of bytes written into the device writable portion of the
buffer described by the descriptor chain. */

I believe "the descriptor chain which was used" includes both the
descriptors that map the device read only and the device write
only portions of the buffer described by the descriptor chain. And the
total length of that descriptor chain may be defined either as a number
of the descriptors that form the chain, or the length of the buffer.

One has to use the descriptor chain even if the whole buffer is device
read only. So "used" == "written to" does not make any sense to me.

Also something like
int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int 
bytes_written)
instead of
int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
would make it easier to read the code correctly.

> >
> > If my suspicion is right something like:
> >
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 00f64f2f8b72..efb57898920b 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct 
> > virtqueue *_vq,
> > struct vring_virtqueue *vq = to_vvq(_vq);
> > void *ret;
> > unsigned int i;
> > +   bool has_in;
> > u16 last_used;
> >
> > START_USE(vq);
> > @@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct 
> > virtqueue *_vq,
> > vq->split.vring.used->ring[last_used].id);
> > *len = virtio32_to_cpu(_vq->vdev,
> > vq->split.vring.used->ring[last_used].len);
> > +   has_in = virtio16_to_cpu(_vq->vdev,
> > +   vq->split.vring.used->ring[last_used].flags)
> > +   & VRING_DESC_F_WRITE;  
> 
> Did you mean vring.desc actually? If yes, it's better not depend on
> the descriptor ring which can be modified by the device. We've stored
> the flags in desc_extra[].
> 
> >
> > if (unlikely(i >= vq->split.vring.num)) {
> > BAD_RING(vq, "id %u out of range\n", i);
> > @@ -796,7 +800,7 @@ static void *virtqueue_get_buf_ctx_split(struct 
> > virtqueue *_vq,
> > BAD_RING(vq, "id %u is not a head!\n", i);
> > return NULL;
> > }
> > -   if (vq->buflen && unlikely(*len > vq->buflen[i])) {
> > +   if (has_in && q->buflen && unlikely(*len > vq->buflen[i])) {
> > BAD_RING(vq, "used len %d is larger than in buflen %u\n",
> > *len, vq->buflen[i]);
> >

Re: [PATCH V5 07/10] io_uring: switch to kernel_worker

2021-11-22 Thread Jens Axboe
On 11/22/21 3:02 AM, Christian Brauner wrote:
> On Sun, Nov 21, 2021 at 11:17:11AM -0700, Jens Axboe wrote:
>> On 11/21/21 10:49 AM, Mike Christie wrote:
>>> Convert io_uring and io-wq to use kernel_worker.
>>
>> I don't like the kernel_worker name, that implies it's always giving you
>> a kernel thread or kthread. That's not the io_uring use case, it's
>> really just a thread off the original task that just happens to never
>> exit to userspace.
>>
>> Can we do a better name? At least io_thread doesn't imply that.
> 
> Yeah, I had thought about that as well and at first had kernel_uworker()
> locally but wasn't convinced. Maybe we should just make it
> create_user_worker()?

That's better, or maybe even create_user_inkernel_thread() or something?
Pretty long, though... I'd be fine with create_user_worker().

-- 
Jens Axboe

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-22 Thread Halil Pasic
On Mon, 22 Nov 2021 12:08:22 +0100
Stefano Garzarella  wrote:

> On Mon, Nov 22, 2021 at 08:55:24AM +0100, Stefano Garzarella wrote:
> >On Mon, Nov 22, 2021 at 02:25:26PM +0800, Jason Wang wrote:  
> >>On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic  wrote:  
> >>>
> >>>On Mon, 22 Nov 2021 06:35:18 +0100
> >>>Halil Pasic  wrote:
> >>>  
>  > I think it should be a common issue, looking at
>  > vhost_vsock_handle_tx_kick(), it did:
>  >
>  > len += sizeof(pkt->hdr);
>  > vhost_add_used(vq, head, len);
>  >
>  > which looks like a violation of the spec since it's TX.  
> 
>  I'm not sure the lines above look like a violation of the spec. If you
>  examine vhost_vsock_alloc_pkt() I believe that you will agree that:
>  len == pkt->len == pkt->hdr.len
>  which makes sense since according to the spec both tx and rx messages
>  are hdr+payload. And I believe hdr.len is the size of the payload,
>  although that does not seem to be properly documented by the spec.  
> >>
> >>Sorry for being unclear, what I meant is that we probably should use
> >>zero here. TX doesn't use in buffer actually.
> >>
> >>According to the spec, 0 should be the used length:
> >>
> >>"and len the total of bytes written into the buffer."
> >>  
> 
>  On the other hand tx messages are stated to be device read-only (in the
>  spec) so if the device writes stuff, that is certainly wrong.
>   
> >>
> >>Yes.
> >>  
>  If that is what happens.
> 
>  Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
>  happens. My hypothesis is that we just a last descriptor is an 'in'
>  type descriptor (i.e. a device writable one). For tx that assumption
>  would be wrong.
> 
>  I will have another look at this today and send a fix patch if my
>  suspicion is confirmed.  
> >>>
> >>>If my suspicion is right something like:
> >>>
> >>>diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >>>index 00f64f2f8b72..efb57898920b 100644
> >>>--- a/drivers/virtio/virtio_ring.c
> >>>+++ b/drivers/virtio/virtio_ring.c
> >>>@@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct 
> >>>virtqueue *_vq,
> >>>struct vring_virtqueue *vq = to_vvq(_vq);
> >>>void *ret;
> >>>unsigned int i;
> >>>+   bool has_in;
> >>>u16 last_used;
> >>>
> >>>START_USE(vq);
> >>>@@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct 
> >>>virtqueue *_vq,
> >>>vq->split.vring.used->ring[last_used].id);
> >>>*len = virtio32_to_cpu(_vq->vdev,
> >>>vq->split.vring.used->ring[last_used].len);
> >>>+   has_in = virtio16_to_cpu(_vq->vdev,
> >>>+   vq->split.vring.used->ring[last_used].flags)
> >>>+   & VRING_DESC_F_WRITE;  
> >>
> >>Did you mean vring.desc actually? If yes, it's better not depend on
> >>the descriptor ring which can be modified by the device. We've stored
> >>the flags in desc_extra[].
> >>  
> >>>
> >>>if (unlikely(i >= vq->split.vring.num)) {
> >>>BAD_RING(vq, "id %u out of range\n", i);
> >>>@@ -796,7 +800,7 @@ static void *virtqueue_get_buf_ctx_split(struct 
> >>>virtqueue *_vq,
> >>>BAD_RING(vq, "id %u is not a head!\n", i);
> >>>return NULL;
> >>>}
> >>>-   if (vq->buflen && unlikely(*len > vq->buflen[i])) {
> >>>+   if (has_in && q->buflen && unlikely(*len > vq->buflen[i])) {
> >>>BAD_RING(vq, "used len %d is larger than in buflen %u\n",
> >>>*len, vq->buflen[i]);
> >>>return NULL;
> >>>
> >>>would fix the problem for split. I will try that out and let you know
> >>>later.  
> >>
> >>I'm not sure I get this, in virtqueue_add_split, the buflen[i] only
> >>contains the in buffer length.
> >>
> >>I think the fixes are:
> >>
> >>1) fixing the vhost vsock  
> >
> >Yep, in vhost_vsock_handle_tx_kick() we should have vhost_add_used(vq, 
> >head, 0) since the device doesn't write anything.
> >  
> >>2) use suppress_used_validation=true to let vsock driver to validate
> >>the in buffer length
> >>3) probably a new feature so the driver can only enable the validation
> >>when the feature is enabled.  
> >
> >I fully agree with these steps.  
> 
> Michael sent a patch to suppress the validation, so I think we should 
> just fix vhost-vsock. I mean something like this:
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 938aefbc75ec..4e3b95af7ee4 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -554,7 +554,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work 
> *work)
>  virtio_transport_free_pkt(pkt);
> 
>  len += sizeof(pkt->hdr);
> -   vhost_add_used(vq, head, len);
> +   vhost_add_used(vq, head, 0);
>  total_len += len;

RE: [PATCH 1/2] vdpa: Add support for querying statistics

2021-11-22 Thread Parav Pandit via Virtualization



> From: Eli Cohen 
> Sent: Monday, November 22, 2021 8:37 PM
> 
> On Mon, Nov 22, 2021 at 12:15:36PM +0200, Parav Pandit wrote:
> >
> >
> > > From: Jason Wang 
> > > Sent: Monday, November 22, 2021 3:02 PM
> > >
> > > > If we go with vendor stats, how can we communicate the information
> > > > to userspace? Currenlty we use netlink attributes defined to pass
> > > > this information.
> > >
> > > It can be done exactly as what have been done in the patch, we can
> > > document it as vendor stats.
> > >
> > Yes, attribute to have VENDOR_ prefix in it.
> > >
> > > Ok, I think I get you. So I wonder if it's more useful to use device
> > > specific counters. For networking, it could be packets send/received etc.
> >
> > Yes, I equally discussed this previously with Eli as its more meaningful 
> > for end
> users.
> > We just return the device id of it along with queue number that helps to 
> > show
> tx and rx.
> > For ctrl q, it is just ctrl commands and ctrl completions.
> 
> I don't think we should mix send/receive packets for descriptors statistics. 
> The
> hardware could process a descriptor and still not transmit any packet.
> 
> We can add packets send/recv but descriptor statistics have their own value.
> 
Oh right. I read Jason's comment of _packets_ to fast. I meant to say 
send/receive descriptors.
I guess you already named them as tx and rx. Didn't review the patches in this 
series yet.

> To summarize, I can add the VENDOR_ preifx to the attibutes and re-send or is
> there anything else you think should change?
VENDOR_ prefix and command as iproute2 command as "vstats" looks fine to me.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-22 Thread Stefano Garzarella

On Mon, Nov 22, 2021 at 03:24:32PM +0100, Halil Pasic wrote:

On Mon, 22 Nov 2021 12:08:22 +0100
Stefano Garzarella  wrote:


On Mon, Nov 22, 2021 at 08:55:24AM +0100, Stefano Garzarella wrote:
>On Mon, Nov 22, 2021 at 02:25:26PM +0800, Jason Wang wrote:
>>On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic  wrote:
>>>
>>>On Mon, 22 Nov 2021 06:35:18 +0100
>>>Halil Pasic  wrote:
>>>
 > I think it should be a common issue, looking at
 > vhost_vsock_handle_tx_kick(), it did:
 >
 > len += sizeof(pkt->hdr);
 > vhost_add_used(vq, head, len);
 >
 > which looks like a violation of the spec since it's TX.

 I'm not sure the lines above look like a violation of the spec. If you
 examine vhost_vsock_alloc_pkt() I believe that you will agree that:
 len == pkt->len == pkt->hdr.len
 which makes sense since according to the spec both tx and rx messages
 are hdr+payload. And I believe hdr.len is the size of the payload,
 although that does not seem to be properly documented by the spec.
>>
>>Sorry for being unclear, what I meant is that we probably should use
>>zero here. TX doesn't use in buffer actually.
>>
>>According to the spec, 0 should be the used length:
>>
>>"and len the total of bytes written into the buffer."
>>

 On the other hand tx messages are stated to be device read-only (in the
 spec) so if the device writes stuff, that is certainly wrong.

>>
>>Yes.
>>
 If that is what happens.

 Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
 happens. My hypothesis is that we just a last descriptor is an 'in'
 type descriptor (i.e. a device writable one). For tx that assumption
 would be wrong.

 I will have another look at this today and send a fix patch if my
 suspicion is confirmed.
>>>
>>>If my suspicion is right something like:
>>>
>>>diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>>index 00f64f2f8b72..efb57898920b 100644
>>>--- a/drivers/virtio/virtio_ring.c
>>>+++ b/drivers/virtio/virtio_ring.c
>>>@@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct 
virtqueue *_vq,
>>>struct vring_virtqueue *vq = to_vvq(_vq);
>>>void *ret;
>>>unsigned int i;
>>>+   bool has_in;
>>>u16 last_used;
>>>
>>>START_USE(vq);
>>>@@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct 
virtqueue *_vq,
>>>vq->split.vring.used->ring[last_used].id);
>>>*len = virtio32_to_cpu(_vq->vdev,
>>>vq->split.vring.used->ring[last_used].len);
>>>+   has_in = virtio16_to_cpu(_vq->vdev,
>>>+   vq->split.vring.used->ring[last_used].flags)
>>>+   & VRING_DESC_F_WRITE;
>>
>>Did you mean vring.desc actually? If yes, it's better not depend on
>>the descriptor ring which can be modified by the device. We've stored
>>the flags in desc_extra[].
>>
>>>
>>>if (unlikely(i >= vq->split.vring.num)) {
>>>BAD_RING(vq, "id %u out of range\n", i);
>>>@@ -796,7 +800,7 @@ static void *virtqueue_get_buf_ctx_split(struct 
virtqueue *_vq,
>>>BAD_RING(vq, "id %u is not a head!\n", i);
>>>return NULL;
>>>}
>>>-   if (vq->buflen && unlikely(*len > vq->buflen[i])) {
>>>+   if (has_in && q->buflen && unlikely(*len > vq->buflen[i])) {
>>>BAD_RING(vq, "used len %d is larger than in buflen %u\n",
>>>*len, vq->buflen[i]);
>>>return NULL;
>>>
>>>would fix the problem for split. I will try that out and let you know
>>>later.
>>
>>I'm not sure I get this, in virtqueue_add_split, the buflen[i] only
>>contains the in buffer length.
>>
>>I think the fixes are:
>>
>>1) fixing the vhost vsock
>
>Yep, in vhost_vsock_handle_tx_kick() we should have vhost_add_used(vq,
>head, 0) since the device doesn't write anything.
>
>>2) use suppress_used_validation=true to let vsock driver to validate
>>the in buffer length
>>3) probably a new feature so the driver can only enable the validation
>>when the feature is enabled.
>
>I fully agree with these steps.

Michael sent a patch to suppress the validation, so I think we should
just fix vhost-vsock. I mean something like this:

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 938aefbc75ec..4e3b95af7ee4 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -554,7 +554,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work 
*work)
 virtio_transport_free_pkt(pkt);

 len += sizeof(pkt->hdr);
-   vhost_add_used(vq, head, len);
+   vhost_add_used(vq, head, 0);
 total_len += len;
 added = true;
 } while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));

I checked and the problem is there from the first commit, so we should
add:

Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost

[PATCH 0/2] vhost/vsock: fix used length and cleanup in vhost_vsock_handle_tx_kick()

2021-11-22 Thread Stefano Garzarella
This is a follow-up to Micheal's patch [1] and the discussion with Halil and
Jason [2].

I made two patches, one to fix the problem and one for cleanup. This should
simplify the backport of the fix because we've had the problem since
vhost-vsock was introduced (v4.8) and that part has been touched a bit
recently.

Thanks,
Stefano

[1] 
https://lore.kernel.org/virtualization/20211122105822.onarsa4sydzxqynu@steredhat/T/#t
[2] 
https://lore.kernel.org/virtualization/20211027022107.14357-1-jasow...@redhat.com/T/#t

Stefano Garzarella (2):
  vhost/vsock: fix incorrect used length reported to the guest
  vhost/vsock: cleanup removing `len` variable

 drivers/vhost/vsock.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/2] vhost/vsock: fix incorrect used length reported to the guest

2021-11-22 Thread Stefano Garzarella
The "used length" reported by calling vhost_add_used() must be the
number of bytes written by the device (using "in" buffers).

In vhost_vsock_handle_tx_kick() the device only reads the guest
buffers (they are all "out" buffers), without writing anything,
so we must pass 0 as "used length" to comply virtio spec.

Fixes: 433fc58e6bf2 ("VSOCK: Introduce vhost_vsock.ko")
Cc: sta...@vger.kernel.org
Reported-by: Halil Pasic 
Suggested-by: Jason Wang 
Signed-off-by: Stefano Garzarella 
---
 drivers/vhost/vsock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 938aefbc75ec..4e3b95af7ee4 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -554,7 +554,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work 
*work)
virtio_transport_free_pkt(pkt);
 
len += sizeof(pkt->hdr);
-   vhost_add_used(vq, head, len);
+   vhost_add_used(vq, head, 0);
total_len += len;
added = true;
} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/2] vhost/vsock: cleanup removing `len` variable

2021-11-22 Thread Stefano Garzarella
We can increment `total_len` directly and remove `len` since it
is no longer used for vhost_add_used().

Signed-off-by: Stefano Garzarella 
---
 drivers/vhost/vsock.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 4e3b95af7ee4..d6ca1c7ad513 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -511,8 +511,6 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work 
*work)
 
vhost_disable_notify(&vsock->dev, vq);
do {
-   u32 len;
-
if (!vhost_vsock_more_replies(vsock)) {
/* Stop tx until the device processes already
 * pending replies.  Leave tx virtqueue
@@ -540,7 +538,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work 
*work)
continue;
}
 
-   len = pkt->len;
+   total_len += sizeof(pkt->hdr) + pkt->len;
 
/* Deliver to monitoring devices all received packets */
virtio_transport_deliver_tap_pkt(pkt);
@@ -553,9 +551,7 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work 
*work)
else
virtio_transport_free_pkt(pkt);
 
-   len += sizeof(pkt->hdr);
vhost_add_used(vq, head, 0);
-   total_len += len;
added = true;
} while(likely(!vhost_exceeds_weight(vq, ++pkts, total_len)));
 
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 07/10] io_uring: switch to kernel_worker

2021-11-22 Thread michael . christie
On 11/22/21 8:20 AM, Jens Axboe wrote:
> On 11/22/21 3:02 AM, Christian Brauner wrote:
>> On Sun, Nov 21, 2021 at 11:17:11AM -0700, Jens Axboe wrote:
>>> On 11/21/21 10:49 AM, Mike Christie wrote:
 Convert io_uring and io-wq to use kernel_worker.
>>>
>>> I don't like the kernel_worker name, that implies it's always giving you
>>> a kernel thread or kthread. That's not the io_uring use case, it's
>>> really just a thread off the original task that just happens to never
>>> exit to userspace.
>>>
>>> Can we do a better name? At least io_thread doesn't imply that.
>>
>> Yeah, I had thought about that as well and at first had kernel_uworker()
>> locally but wasn't convinced. Maybe we should just make it
>> create_user_worker()?
> 
> That's better, or maybe even create_user_inkernel_thread() or something?
> Pretty long, though... I'd be fine with create_user_worker().
> 

Ok, I'll do:

create_user_worker()
start_user_worker()

since you guys agree. It will also match the PF flag naming.

I'll also add more details to the commit message you requested.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-22 Thread Halil Pasic
On Mon, 22 Nov 2021 14:25:26 +0800
Jason Wang  wrote:

> I think the fixes are:
> 
> 1) fixing the vhost vsock
> 2) use suppress_used_validation=true to let vsock driver to validate
> the in buffer length
> 3) probably a new feature so the driver can only enable the validation
> when the feature is enabled.

I'm not sure, I would consider a F_DEV_Y_FIXED_BUG_X a perfectly good
feature. Frankly the set of such bugs is device implementation
specific and it makes little sense to specify a feature bit
that says the device implementation claims to adhere to some
aspect of the specification. Also what would be the semantic
of not negotiating F_DEV_Y_FIXED_BUG_X?

On the other hand I see no other way to keep the validation
permanently enabled for fixed implementations, and get around the problem
with broken implementations. So we could have something like
VHOST_USED_LEN_STRICT.

Maybe, we can also think of 'warn and don't alter behavior' instead of
'warn' and alter behavior. Or maybe even not having such checks on in
production, but only when testing.

Regards,
Halil
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vdpa_sim: avoid putting an uninitialized iova_domain

2021-11-22 Thread Jason Wang
On Mon, Nov 22, 2021 at 8:22 PM Longpeng(Mike)  wrote:
>
> From: Longpeng 
>
> The system will crash if we put an uninitialized iova_domain, this
> could happen when an error occurs before initializing the iova_domain
> in vdpasim_create().
>
> BUG: kernel NULL pointer dereference, address: 
> ...
> RIP: 0010:__cpuhp_state_remove_instance+0x96/0x1c0
> ...
> Call Trace:
>  
>  put_iova_domain+0x29/0x220
>  vdpasim_free+0xd1/0x120 [vdpa_sim]
>  vdpa_release_dev+0x21/0x40 [vdpa]
>  device_release+0x33/0x90
>  kobject_release+0x63/0x160
>  vdpasim_create+0x127/0x2a0 [vdpa_sim]
>  vdpasim_net_dev_add+0x7d/0xfe [vdpa_sim_net]
>  vdpa_nl_cmd_dev_add_set_doit+0xe1/0x1a0 [vdpa]
>  genl_family_rcv_msg_doit+0x112/0x140
>  genl_rcv_msg+0xdf/0x1d0
>  ...
>
> So we must make sure the iova_domain is already initialized before
> put it.
>
> In addition, we may get the following warning in this case:
> WARNING: ... drivers/iommu/iova.c:344 iova_cache_put+0x58/0x70
>
> So we must make sure the iova_cache_put() is invoked only if the
> iova_cache_get() is already invoked. Let's fix it together.
>
> Signed-off-by: Longpeng 

Acked-by: Jason Wang 

> ---
>  drivers/vdpa/vdpa_sim/vdpa_sim.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c 
> b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index 5f484fff8dbe..41b0cd17fcba 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -591,8 +591,11 @@ static void vdpasim_free(struct vdpa_device *vdpa)
> vringh_kiov_cleanup(&vdpasim->vqs[i].in_iov);
> }
>
> -   put_iova_domain(&vdpasim->iova);
> -   iova_cache_put();
> +   if (vdpa_get_dma_dev(vdpa)) {
> +   put_iova_domain(&vdpasim->iova);
> +   iova_cache_put();
> +   }
> +
> kvfree(vdpasim->buffer);
> if (vdpasim->iommu)
> vhost_iotlb_free(vdpasim->iommu);
> --
> 2.27.0
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-22 Thread Jason Wang
On Tue, Nov 23, 2021 at 4:24 AM Halil Pasic  wrote:
>
> On Mon, 22 Nov 2021 14:25:26 +0800
> Jason Wang  wrote:
>
> > I think the fixes are:
> >
> > 1) fixing the vhost vsock
> > 2) use suppress_used_validation=true to let vsock driver to validate
> > the in buffer length
> > 3) probably a new feature so the driver can only enable the validation
> > when the feature is enabled.
>
> I'm not sure, I would consider a F_DEV_Y_FIXED_BUG_X a perfectly good
> feature. Frankly the set of such bugs is device implementation
> specific and it makes little sense to specify a feature bit
> that says the device implementation claims to adhere to some
> aspect of the specification. Also what would be the semantic
> of not negotiating F_DEV_Y_FIXED_BUG_X?

Yes, I agree. Rethink of the feature bit, it seems unnecessary,
especially considering the driver should not care about the used
length for tx.

>
> On the other hand I see no other way to keep the validation
> permanently enabled for fixed implementations, and get around the problem
> with broken implementations. So we could have something like
> VHOST_USED_LEN_STRICT.

It's more about a choice of the driver's knowledge. For vsock TX it
should be fine. If we introduce a parameter and disable it by default,
it won't be very useful.

>
> Maybe, we can also think of 'warn and don't alter behavior' instead of
> 'warn' and alter behavior. Or maybe even not having such checks on in
> production, but only when testing.

I think there's an agreement that virtio drivers need more hardening,
that's why a lot of patches were merged. Especially considering the
new requirements came from confidential computing, smart NIC and
VDUSE. For virtio drivers, enabling the validation may help to

1) protect the driver from the buggy and malicious device
2) uncover the bugs of the devices (as vsock did, and probably rpmsg)
3) force the have a smart driver that can do the validation itself
then we can finally remove the validation in the core

So I'd like to keep it enabled.

Thanks

>
> Regards,
> Halil
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] vdpa: Add support for querying statistics

2021-11-22 Thread Jason Wang
On Mon, Nov 22, 2021 at 11:56 PM Parav Pandit  wrote:
>
>
>
> > From: Eli Cohen 
> > Sent: Monday, November 22, 2021 8:37 PM
> >
> > On Mon, Nov 22, 2021 at 12:15:36PM +0200, Parav Pandit wrote:
> > >
> > >
> > > > From: Jason Wang 
> > > > Sent: Monday, November 22, 2021 3:02 PM
> > > >
> > > > > If we go with vendor stats, how can we communicate the information
> > > > > to userspace? Currenlty we use netlink attributes defined to pass
> > > > > this information.
> > > >
> > > > It can be done exactly as what have been done in the patch, we can
> > > > document it as vendor stats.
> > > >
> > > Yes, attribute to have VENDOR_ prefix in it.
> > > >
> > > > Ok, I think I get you. So I wonder if it's more useful to use device
> > > > specific counters. For networking, it could be packets send/received 
> > > > etc.
> > >
> > > Yes, I equally discussed this previously with Eli as its more meaningful 
> > > for end
> > users.
> > > We just return the device id of it along with queue number that helps to 
> > > show
> > tx and rx.
> > > For ctrl q, it is just ctrl commands and ctrl completions.
> >
> > I don't think we should mix send/receive packets for descriptors 
> > statistics. The
> > hardware could process a descriptor and still not transmit any packet.
> >
> > We can add packets send/recv but descriptor statistics have their own value.
> >
> Oh right. I read Jason's comment of _packets_ to fast. I meant to say 
> send/receive descriptors.
> I guess you already named them as tx and rx. Didn't review the patches in 
> this series yet.
>
> > To summarize, I can add the VENDOR_ preifx to the attibutes and re-send or 
> > is
> > there anything else you think should change?
> VENDOR_ prefix and command as iproute2 command as "vstats" looks fine to me.

Ack, but we need to figure out:

1) use descriptors or buffers.
2) if we use descriptors, for indirect descriptors and descriptor
chains how are they counted?

Thanks

>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V5 1/4] virtio_ring: validate used buffer length

2021-11-22 Thread Jason Wang
On Mon, Nov 22, 2021 at 9:50 PM Halil Pasic  wrote:
>
> On Mon, 22 Nov 2021 14:25:26 +0800
> Jason Wang  wrote:
>
> > On Mon, Nov 22, 2021 at 1:49 PM Halil Pasic  wrote:
> > >
> > > On Mon, 22 Nov 2021 06:35:18 +0100
> > > Halil Pasic  wrote:
> > >
> > > > > I think it should be a common issue, looking at
> > > > > vhost_vsock_handle_tx_kick(), it did:
> > > > >
> > > > > len += sizeof(pkt->hdr);
> > > > > vhost_add_used(vq, head, len);
> > > > >
> > > > > which looks like a violation of the spec since it's TX.
> > > >
> > > > I'm not sure the lines above look like a violation of the spec. If you
> > > > examine vhost_vsock_alloc_pkt() I believe that you will agree that:
> > > > len == pkt->len == pkt->hdr.len
> > > > which makes sense since according to the spec both tx and rx messages
> > > > are hdr+payload. And I believe hdr.len is the size of the payload,
> > > > although that does not seem to be properly documented by the spec.
> >
> > Sorry for being unclear, what I meant is that we probably should use
> > zero here. TX doesn't use in buffer actually.
> >
> > According to the spec, 0 should be the used length:
> >
> > "and len the total of bytes written into the buffer."
>
> Right, I was wrong. I somehow assumed this is the total length and not
> just the number of bytes written.
>
> >
> > > >
> > > > On the other hand tx messages are stated to be device read-only (in the
> > > > spec) so if the device writes stuff, that is certainly wrong.
> > > >
> >
> > Yes.
> >
> > > > If that is what happens.
> > > >
> > > > Looking at virtqueue_get_buf_ctx_split() I'm not sure that is what
> > > > happens. My hypothesis is that we just a last descriptor is an 'in'
> > > > type descriptor (i.e. a device writable one). For tx that assumption
> > > > would be wrong.
> > > >
> > > > I will have another look at this today and send a fix patch if my
> > > > suspicion is confirmed.
>
> Yeah, I didn't remember the semantic of
> vq->split.vring.used->ring[last_used].len
> correctly, and in fact also how exactly the rings work. So your objection
> is correct.
>
> Maybe updating some stuff would make it easier to not make this mistake.
>
> For example the spec and also the linux header says:
>
> /* le32 is used here for ids for padding reasons. */
> struct virtq_used_elem {
> /* Index of start of used descriptor chain. */
> le32 id;
> /* Total length of the descriptor chain which was used (written to) */
> le32 len;
> };
>
> I think that comment isn't as clear as it could be. I would prefer:
> /* The number of bytes written into the device writable portion of the
> buffer described by the descriptor chain. */
>
> I believe "the descriptor chain which was used" includes both the
> descriptors that map the device read only and the device write
> only portions of the buffer described by the descriptor chain. And the
> total length of that descriptor chain may be defined either as a number
> of the descriptors that form the chain, or the length of the buffer.
>
> One has to use the descriptor chain even if the whole buffer is device
> read only. So "used" == "written to" does not make any sense to me.

Not a native speaker but if others are fine I'm ok with this tweak on
the comment.

>
> Also something like
> int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int 
> bytes_written)
> instead of
> int vhost_add_used(struct vhost_virtqueue *vq, unsigned int head, int len)
> would make it easier to read the code correctly.

Or maybe a comment to explain the len.

Thanks

>
> > >
> > > If my suspicion is right something like:
> > >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 00f64f2f8b72..efb57898920b 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -764,6 +764,7 @@ static void *virtqueue_get_buf_ctx_split(struct 
> > > virtqueue *_vq,
> > > struct vring_virtqueue *vq = to_vvq(_vq);
> > > void *ret;
> > > unsigned int i;
> > > +   bool has_in;
> > > u16 last_used;
> > >
> > > START_USE(vq);
> > > @@ -787,6 +788,9 @@ static void *virtqueue_get_buf_ctx_split(struct 
> > > virtqueue *_vq,
> > > vq->split.vring.used->ring[last_used].id);
> > > *len = virtio32_to_cpu(_vq->vdev,
> > > vq->split.vring.used->ring[last_used].len);
> > > +   has_in = virtio16_to_cpu(_vq->vdev,
> > > +   vq->split.vring.used->ring[last_used].flags)
> > > +   & VRING_DESC_F_WRITE;
> >
> > Did you mean vring.desc actually? If yes, it's better not depend on
> > the descriptor ring which can be modified by the device. We've stored
> > the flags in desc_extra[].
> >
> > >
> > > if (unlikely(i >= vq->split.vring.num)) {
> > > BAD_RING(vq, "id %u out of range\n", i);
> > > @@ -796,7 +800,7 @@ static void *virtqueue_get_buf_ctx_split(struct 
> > > vir

Re: [PATCH 02/29] dm: make the DAX support dependend on CONFIG_FS_DAX

2021-11-22 Thread Dan Williams
On Thu, Nov 18, 2021 at 10:55 PM Christoph Hellwig  wrote:
>
> On Wed, Nov 17, 2021 at 09:23:44AM -0800, Dan Williams wrote:
> > Applied, fixed the spelling of 'dependent' in the subject and picked
> > up Mike's Ack from the previous send:
> >
> > https://lore.kernel.org/r/yyasbvuorceds...@redhat.com
> >
> > Christoph, any particular reason you did not pick up the tags from the
> > last posting?
>
> I thought I did, but apparently I've missed some.

I'll reply with the ones I see missing that need carrying over and add
my own reviewed-by then you can send me a pull request when ready,
deal?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 03/29] dax: remove CONFIG_DAX_DRIVER

2021-11-22 Thread Dan Williams
On Wed, Nov 17, 2021 at 9:43 AM Dan Williams  wrote:
>
> On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
> >
> > CONFIG_DAX_DRIVER only selects CONFIG_DAX now, so remove it.
>
> Applied.

Unapplied,

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [PATCH] vdpa_sim: avoid putting an uninitialized iova_domain

2021-11-22 Thread Parav Pandit via Virtualization



> From: Longpeng(Mike) 
> Sent: Monday, November 22, 2021 5:52 PM
> 
> From: Longpeng 
> 
> The system will crash if we put an uninitialized iova_domain, this could
> happen when an error occurs before initializing the iova_domain in
> vdpasim_create().
> 
> BUG: kernel NULL pointer dereference, address:  ...
> RIP: 0010:__cpuhp_state_remove_instance+0x96/0x1c0
> ...
> Call Trace:
>  
>  put_iova_domain+0x29/0x220
>  vdpasim_free+0xd1/0x120 [vdpa_sim]
>  vdpa_release_dev+0x21/0x40 [vdpa]
>  device_release+0x33/0x90
>  kobject_release+0x63/0x160
>  vdpasim_create+0x127/0x2a0 [vdpa_sim]
>  vdpasim_net_dev_add+0x7d/0xfe [vdpa_sim_net]
>  vdpa_nl_cmd_dev_add_set_doit+0xe1/0x1a0 [vdpa]
>  genl_family_rcv_msg_doit+0x112/0x140
>  genl_rcv_msg+0xdf/0x1d0
>  ...
> 
> So we must make sure the iova_domain is already initialized before put it.
> 
> In addition, we may get the following warning in this case:
> WARNING: ... drivers/iommu/iova.c:344 iova_cache_put+0x58/0x70
> 
> So we must make sure the iova_cache_put() is invoked only if the
> iova_cache_get() is already invoked. Let's fix it together.
> 
> Signed-off-by: Longpeng 

Can you please add the fixes tag here so that older kernels can take this fix?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 04/29] dax: simplify the dax_device <-> gendisk association

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Replace the dax_host_hash with an xarray indexed by the pointer value
> of the gendisk, and require explicitly calls from the block drivers that
> want to associate their gendisk with a dax_device.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Mike Snitzer 
> ---
>  drivers/dax/bus.c|   6 +-
>  drivers/dax/super.c  | 106 +--
>  drivers/md/dm.c  |   6 +-
>  drivers/nvdimm/pmem.c|   8 ++-
>  drivers/s390/block/dcssblk.c |  11 +++-
>  fs/fuse/virtio_fs.c  |   2 +-
>  include/linux/dax.h  |  19 +--
>  7 files changed, 62 insertions(+), 96 deletions(-)
>
> diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
> index 6cc4da4c713d9..bd7af2f7c5b0a 100644
> --- a/drivers/dax/bus.c
> +++ b/drivers/dax/bus.c
> @@ -1323,10 +1323,10 @@ struct dev_dax *devm_create_dev_dax(struct 
> dev_dax_data *data)
> }
>
> /*
> -* No 'host' or dax_operations since there is no access to this
> -* device outside of mmap of the resulting character device.
> +* No dax_operations since there is no access to this device outside 
> of
> +* mmap of the resulting character device.
>  */
> -   dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
> +   dax_dev = alloc_dax(dev_dax, NULL, DAXDEV_F_SYNC);
> if (IS_ERR(dax_dev)) {
> rc = PTR_ERR(dax_dev);
> goto err_alloc_dax;
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index e20d0cef10a18..9383c11b21853 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -7,10 +7,8 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include 
> @@ -26,10 +24,8 @@
>   * @flags: state and boolean properties
>   */
>  struct dax_device {
> -   struct hlist_node list;
> struct inode inode;
> struct cdev cdev;
> -   const char *host;
> void *private;
> unsigned long flags;
> const struct dax_operations *ops;
> @@ -42,10 +38,6 @@ static DEFINE_IDA(dax_minor_ida);
>  static struct kmem_cache *dax_cache __read_mostly;
>  static struct super_block *dax_superblock __read_mostly;
>
> -#define DAX_HASH_SIZE (PAGE_SIZE / sizeof(struct hlist_head))
> -static struct hlist_head dax_host_list[DAX_HASH_SIZE];
> -static DEFINE_SPINLOCK(dax_host_lock);
> -
>  int dax_read_lock(void)
>  {
> return srcu_read_lock(&dax_srcu);
> @@ -58,13 +50,22 @@ void dax_read_unlock(int id)
>  }
>  EXPORT_SYMBOL_GPL(dax_read_unlock);
>
> -static int dax_host_hash(const char *host)
> +#if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
> +#include 
> +
> +static DEFINE_XARRAY(dax_hosts);
> +
> +int dax_add_host(struct dax_device *dax_dev, struct gendisk *disk)
>  {
> -   return hashlen_hash(hashlen_string("DAX", host)) % DAX_HASH_SIZE;
> +   return xa_insert(&dax_hosts, (unsigned long)disk, dax_dev, 
> GFP_KERNEL);
>  }
> +EXPORT_SYMBOL_GPL(dax_add_host);

Is it time to add a "DAX" symbol namespace?

>
> -#if defined(CONFIG_BLOCK) && defined(CONFIG_FS_DAX)
> -#include 
> +void dax_remove_host(struct gendisk *disk)
> +{
> +   xa_erase(&dax_hosts, (unsigned long)disk);
> +}
> +EXPORT_SYMBOL_GPL(dax_remove_host);
>
>  int bdev_dax_pgoff(struct block_device *bdev, sector_t sector, size_t size,
> pgoff_t *pgoff)
> @@ -82,40 +83,23 @@ EXPORT_SYMBOL(bdev_dax_pgoff);
>
>  /**
>   * dax_get_by_host() - temporary lookup mechanism for filesystem-dax
> - * @host: alternate name for the device registered by a dax driver
> + * @bdev: block device to find a dax_device for
>   */
> -static struct dax_device *dax_get_by_host(const char *host)
> +struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
>  {
> -   struct dax_device *dax_dev, *found = NULL;
> -   int hash, id;
> +   struct dax_device *dax_dev;
> +   int id;
>
> -   if (!host)
> +   if (!blk_queue_dax(bdev->bd_disk->queue))
> return NULL;
>
> -   hash = dax_host_hash(host);
> -
> id = dax_read_lock();
> -   spin_lock(&dax_host_lock);
> -   hlist_for_each_entry(dax_dev, &dax_host_list[hash], list) {
> -   if (!dax_alive(dax_dev)
> -   || strcmp(host, dax_dev->host) != 0)
> -   continue;
> -
> -   if (igrab(&dax_dev->inode))
> -   found = dax_dev;
> -   break;
> -   }
> -   spin_unlock(&dax_host_lock);
> +   dax_dev = xa_load(&dax_hosts, (unsigned long)bdev->bd_disk);
> +   if (!dax_dev || !dax_alive(dax_dev) || !igrab(&dax_dev->inode))
> +   dax_dev = NULL;
> dax_read_unlock(id);
>
> -   return found;
> -}
> -
> -struct dax_device *fs_dax_get_by_bdev(struct block_device *bdev)
> -{
> -   if (!blk_queue_dax(bdev->bd_disk->queue))
> -   return 

Re: [PATCH 05/29] dax: remove the pgmap sanity checks in generic_fsdax_supported

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Drivers that register a dax_dev should make sure it works, no need
> to double check from the file system.

Reviewed-by: Dan Williams 

...with a self-reminder to migrate this validation to a unit test to
backstop any future refactoring of the memmap reservation code.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 06/29] dax: move the partition alignment check into fs_dax_get_by_bdev

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> fs_dax_get_by_bdev is the primary interface to find a dax device for a
> block device, so move the partition alignment check there instead of
> wiring it up through ->dax_supported.
>

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 07/29] xfs: factor out a xfs_setup_dax_always helper

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Factor out another DAX setup helper to simplify future changes.  Also
> move the experimental warning after the checks to not clutter the log
> too much if the setup failed.
>
> Signed-off-by: Christoph Hellwig 

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 08/29] dax: remove dax_capable

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Just open code the block size and dax_dev == NULL checks in the callers.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Mike Snitzer 

You dropped Gao Xiang's reviewed-by:

https://lore.kernel.org/r/YW1nrxmaw5CfFXBb@B-P7TQMD6M-0146.local

...and Darrick's

https://lore.kernel.org/r/20211019154447.GL24282@magnolia

...which had a few more review comments below, otherwise you can also add:

Reviewed-by: Dan Williams 


> ---
>  drivers/dax/super.c  | 36 
>  drivers/md/dm-table.c| 22 +++---
>  drivers/md/dm.c  | 21 -
>  drivers/md/dm.h  |  4 
>  drivers/nvdimm/pmem.c|  1 -
>  drivers/s390/block/dcssblk.c |  1 -
>  fs/erofs/super.c | 11 +++
>  fs/ext2/super.c  |  6 --
>  fs/ext4/super.c  |  9 ++---
>  fs/xfs/xfs_super.c   | 21 -
>  include/linux/dax.h  | 14 --
>  11 files changed, 36 insertions(+), 110 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index 482fe775324a4..803942586d1b6 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -108,42 +108,6 @@ struct dax_device *fs_dax_get_by_bdev(struct 
> block_device *bdev)
> return dax_dev;
>  }
>  EXPORT_SYMBOL_GPL(fs_dax_get_by_bdev);
> -
> -bool generic_fsdax_supported(struct dax_device *dax_dev,
> -   struct block_device *bdev, int blocksize, sector_t start,
> -   sector_t sectors)
> -{
> -   if (blocksize != PAGE_SIZE) {
> -   pr_info("%pg: error: unsupported blocksize for dax\n", bdev);
> -   return false;
> -   }
> -
> -   if (!dax_dev) {
> -   pr_debug("%pg: error: dax unsupported by block device\n", 
> bdev);
> -   return false;
> -   }
> -
> -   return true;
> -}
> -EXPORT_SYMBOL_GPL(generic_fsdax_supported);
> -
> -bool dax_supported(struct dax_device *dax_dev, struct block_device *bdev,
> -   int blocksize, sector_t start, sector_t len)
> -{
> -   bool ret = false;
> -   int id;
> -
> -   if (!dax_dev)
> -   return false;
> -
> -   id = dax_read_lock();
> -   if (dax_alive(dax_dev) && dax_dev->ops->dax_supported)
> -   ret = dax_dev->ops->dax_supported(dax_dev, bdev, blocksize,
> - start, len);
> -   dax_read_unlock(id);
> -   return ret;
> -}
> -EXPORT_SYMBOL_GPL(dax_supported);
>  #endif /* CONFIG_BLOCK && CONFIG_FS_DAX */
>
>  enum dax_device_flags {
> diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
> index bcddc5effd155..f4915a7d5dc84 100644
> --- a/drivers/md/dm-table.c
> +++ b/drivers/md/dm-table.c
> @@ -806,12 +806,14 @@ void dm_table_set_type(struct dm_table *t, enum 
> dm_queue_mode type)
>  EXPORT_SYMBOL_GPL(dm_table_set_type);
>
>  /* validate the dax capability of the target device span */
> -int device_not_dax_capable(struct dm_target *ti, struct dm_dev *dev,
> +static int device_not_dax_capable(struct dm_target *ti, struct dm_dev *dev,
> sector_t start, sector_t len, void *data)
>  {
> -   int blocksize = *(int *) data;
> +   if (dev->dax_dev)
> +   return false;
>
> -   return !dax_supported(dev->dax_dev, dev->bdev, blocksize, start, len);
> +   DMDEBUG("%pg: error: dax unsupported by block device", dev->bdev);
> +   return true;
>  }
>
>  /* Check devices support synchronous DAX */
> @@ -821,8 +823,8 @@ static int device_not_dax_synchronous_capable(struct 
> dm_target *ti, struct dm_de
> return !dev->dax_dev || !dax_synchronous(dev->dax_dev);
>  }
>
> -bool dm_table_supports_dax(struct dm_table *t,
> -  iterate_devices_callout_fn iterate_fn, int 
> *blocksize)
> +static bool dm_table_supports_dax(struct dm_table *t,
> +  iterate_devices_callout_fn iterate_fn)
>  {
> struct dm_target *ti;
> unsigned i;
> @@ -835,7 +837,7 @@ bool dm_table_supports_dax(struct dm_table *t,
> return false;
>
> if (!ti->type->iterate_devices ||
> -   ti->type->iterate_devices(ti, iterate_fn, blocksize))
> +   ti->type->iterate_devices(ti, iterate_fn, NULL))
> return false;
> }
>
> @@ -862,7 +864,6 @@ static int dm_table_determine_type(struct dm_table *t)
> struct dm_target *tgt;
> struct list_head *devices = dm_table_get_devices(t);
> enum dm_queue_mode live_md_type = dm_get_md_type(t->md);
> -   int page_size = PAGE_SIZE;
>
> if (t->type != DM_TYPE_NONE) {
> /* target already set the table's type */
> @@ -906,7 +907,7 @@ static int dm_table_determine_type(struct dm_table *t)
>  verify_bio_based:
> /* We must use this tab

Re: [PATCH 09/29] dm-linear: add a linear_dax_pgoff helper

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Add a helper to perform the entire remapping for DAX accesses.  This
> helper open codes bdev_dax_pgoff given that the alignment checks have
> already been done by the submitting file system and don't need to be
> repeated.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Mike Snitzer 

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 10/29] dm-log-writes: add a log_writes_dax_pgoff helper

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Add a helper to perform the entire remapping for DAX accesses.  This
> helper open codes bdev_dax_pgoff given that the alignment checks have
> already been done by the submitting file system and don't need to be
> repeated.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Mike Snitzer 

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] vdpa_sim: avoid putting an uninitialized iova_domain

2021-11-22 Thread Jason Wang
On Tue, Nov 23, 2021 at 11:12 AM Parav Pandit  wrote:
>
>
>
> > From: Longpeng(Mike) 
> > Sent: Monday, November 22, 2021 5:52 PM
> >
> > From: Longpeng 
> >
> > The system will crash if we put an uninitialized iova_domain, this could
> > happen when an error occurs before initializing the iova_domain in
> > vdpasim_create().
> >
> > BUG: kernel NULL pointer dereference, address:  ...
> > RIP: 0010:__cpuhp_state_remove_instance+0x96/0x1c0
> > ...
> > Call Trace:
> >  
> >  put_iova_domain+0x29/0x220
> >  vdpasim_free+0xd1/0x120 [vdpa_sim]
> >  vdpa_release_dev+0x21/0x40 [vdpa]
> >  device_release+0x33/0x90
> >  kobject_release+0x63/0x160
> >  vdpasim_create+0x127/0x2a0 [vdpa_sim]
> >  vdpasim_net_dev_add+0x7d/0xfe [vdpa_sim_net]
> >  vdpa_nl_cmd_dev_add_set_doit+0xe1/0x1a0 [vdpa]
> >  genl_family_rcv_msg_doit+0x112/0x140
> >  genl_rcv_msg+0xdf/0x1d0
> >  ...
> >
> > So we must make sure the iova_domain is already initialized before put it.
> >
> > In addition, we may get the following warning in this case:
> > WARNING: ... drivers/iommu/iova.c:344 iova_cache_put+0x58/0x70
> >
> > So we must make sure the iova_cache_put() is invoked only if the
> > iova_cache_get() is already invoked. Let's fix it together.
> >
> > Signed-off-by: Longpeng 
>
> Can you please add the fixes tag here so that older kernels can take this fix?
>

I guess it's 4080fc106750 ("vdpa_sim: use iova module to allocate IOVA
addresses")

Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 11/29] dm-stripe: add a stripe_dax_pgoff helper

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:33 AM Christoph Hellwig  wrote:
>
> Add a helper to perform the entire remapping for DAX accesses.  This
> helper open codes bdev_dax_pgoff given that the alignment checks have
> already been done by the submitting file system and don't need to be
> repeated.
>
> Signed-off-by: Christoph Hellwig 
> Acked-by: Mike Snitzer 

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 12/29] fsdax: remove a pointless __force cast in copy_cow_page_dax

2021-11-22 Thread Dan Williams
On Tue, Nov 9, 2021 at 12:34 AM Christoph Hellwig  wrote:
>
> Despite its name copy_user_page expected kernel addresses, which is what
> we already have.

Yup,

Reviewed-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 02/29] dm: make the DAX support dependend on CONFIG_FS_DAX

2021-11-22 Thread Christoph Hellwig
On Mon, Nov 22, 2021 at 06:54:09PM -0800, Dan Williams wrote:
> On Thu, Nov 18, 2021 at 10:55 PM Christoph Hellwig  wrote:
> >
> > On Wed, Nov 17, 2021 at 09:23:44AM -0800, Dan Williams wrote:
> > > Applied, fixed the spelling of 'dependent' in the subject and picked
> > > up Mike's Ack from the previous send:
> > >
> > > https://lore.kernel.org/r/yyasbvuorceds...@redhat.com
> > >
> > > Christoph, any particular reason you did not pick up the tags from the
> > > last posting?
> >
> > I thought I did, but apparently I've missed some.
> 
> I'll reply with the ones I see missing that need carrying over and add
> my own reviewed-by then you can send me a pull request when ready,
> deal?

Ok.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 04/29] dax: simplify the dax_device <-> gendisk association

2021-11-22 Thread Christoph Hellwig
On Mon, Nov 22, 2021 at 07:33:06PM -0800, Dan Williams wrote:
> Is it time to add a "DAX" symbol namespace?

What would be the benefit?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v5 23/26] util: Add iova_tree_alloc

2021-11-22 Thread Peter Xu
Hi, Eugenio,

Sorry for the super late response.

On Fri, Oct 29, 2021 at 08:35:22PM +0200, Eugenio Pérez wrote:

[...]

> +int iova_tree_alloc(IOVATree *tree, DMAMap *map, hwaddr iova_begin,
> +hwaddr iova_last)
> +{
> +struct IOVATreeAllocArgs args = {
> +.new_size = map->size,
> +.iova_begin = iova_begin,
> +.iova_last = iova_last,
> +};
> +
> +if (iova_begin == 0) {
> +/* Some devices does not like addr 0 */
> +iova_begin += qemu_real_host_page_size;
> +}

Any explanation of why zero is not welcomed?

It would be great if we can move this out of iova-tree.c, because that doesn't
look like a good place to, e.g. reference qemu_real_host_page_size, anyways.
The caller can simply pass in qemu_real_host_page_size as iova_begin when
needed (and I highly doubt it'll be a must for all iova-tree users..)

> +
> +assert(iova_begin < iova_last);
> +
> +/*
> + * Find a valid hole for the mapping
> + *
> + * Assuming low iova_begin, so no need to do a binary search to
> + * locate the first node.
> + *
> + * TODO: We can improve the search speed if we save the beginning and the
> + * end of holes, so we don't iterate over the previous saved ones.
> + *
> + * TODO: Replace all this with g_tree_node_first/next/last when available
> + * (from glib since 2.68). To do it with g_tree_foreach complicates the
> + * code a lot.
> + *
> + */
> +g_tree_foreach(tree->tree, iova_tree_alloc_traverse, &args);
> +if (!iova_tree_alloc_map_in_hole(&args)) {
> +/*
> + * 2nd try: Last iteration left args->right as the last DMAMap. But
> + * (right, end) hole needs to be checked too
> + */
> +iova_tree_alloc_args_iterate(&args, NULL);
> +if (!iova_tree_alloc_map_in_hole(&args)) {
> +return IOVA_ERR_NOMEM;
> +}
> +}
> +
> +map->iova = MAX(iova_begin,
> +args.hole_left ?
> +args.hole_left->iova + args.hole_left->size + 1 : 0);
> +return iova_tree_insert(tree, map);
> +}

Re the algorithm - I totally agree Jason's version is much better.

Thanks,

-- 
Peter Xu

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: Kvm virtual machine uses virtio graphics card, the rotating screen is stuck

2021-11-22 Thread Jason Wang
On Tue, Nov 23, 2021 at 3:00 PM 苟浩  wrote:
>
> Hello,
>
> I use `xrandr -o left` to rotate the screen in the kvm virtual machine.
> When configured as a Virtio graphics card, the screen will freeze after 
> rotating the screen, and the keyboard and mouse will not respond.
> When configured as a VGA graphics card, it is normal after rotating the 
> screen.
>
> Is the Virtio graphics card not supporting rotating?

Adding list and Gerd for the answer.

Thanks

>
>
> Thanks!

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization