from:"Jason Wang"

Re: [PATCH] tools/virtio: Use the __GFP_ZERO flag of kmalloc to complete the memory initialization.

2024-06-05 Thread Jason Wang

On Wed, Jun 5, 2024 at 9:56 PM cuitao  wrote:
>
> Use the __GFP_ZERO flag of kmalloc to initialize memory while allocating it,
> without the need for an additional memset call.
>
> Signed-off-by: cuitao 
> ---
>  tools/virtio/linux/kernel.h | 5 +
>  1 file changed, 1 insertion(+), 4 deletions(-)
>
> diff --git a/tools/virtio/linux/kernel.h b/tools/virtio/linux/kernel.h
> index 6702008f7f5c..9e401fb7c215 100644
> --- a/tools/virtio/linux/kernel.h
> +++ b/tools/virtio/linux/kernel.h
> @@ -66,10 +66,7 @@ static inline void *kmalloc_array(unsigned n, size_t s, 
> gfp_t gfp)
>
>  static inline void *kzalloc(size_t s, gfp_t gfp)
>  {
> -   void *p = kmalloc(s, gfp);
> -
> -   memset(p, 0, s);
> -   return p;
> +   return kmalloc(s, gfp | __GFP_ZERO);
>  }
>
>  static inline void *alloc_pages_exact(size_t s, gfp_t gfp)
> --
> 2.25.1
>

Does this really work?

extern void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
static inline void *kmalloc(size_t s, gfp_t gfp)
{
  if (__kmalloc_fake)
return __kmalloc_fake;
return malloc(s);
}

Thanks

Re: [PATCH net-next V2] virtio-net: synchronize operstate with admin state on up/down

2024-06-05 Thread Jason Wang

On Fri, May 31, 2024 at 8:18 AM Jason Wang  wrote:
>
> On Thu, May 30, 2024 at 9:09 PM Michael S. Tsirkin  wrote:
> >
> > On Thu, May 30, 2024 at 06:29:51PM +0800, Jason Wang wrote:
> > > On Thu, May 30, 2024 at 2:10 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Thu, May 30, 2024 at 11:20:55AM +0800, Jason Wang wrote:
> > > > > This patch synchronize operstate with admin state per RFC2863.
> > > > >
> > > > > This is done by trying to toggle the carrier upon open/close and
> > > > > synchronize with the config change work. This allows propagate status
> > > > > correctly to stacked devices like:
> > > > >
> > > > > ip link add link enp0s3 macvlan0 type macvlan
> > > > > ip link set link enp0s3 down
> > > > > ip link show
> > > > >
> > > > > Before this patch:
> > > > >
> > > > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN 
> > > > > mode DEFAULT group default qlen 1000
> > > > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > > > ..
> > > > > 5: macvlan0@enp0s3:  mtu 1500 
> > > > > qdisc noqueue state UP mode DEFAULT group default qlen 1000
> > > > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > > > >
> > > > > After this patch:
> > > > >
> > > > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN 
> > > > > mode DEFAULT group default qlen 1000
> > > > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > > > ...
> > > > > 5: macvlan0@enp0s3:  mtu 
> > > > > 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default 
> > > > > qlen 1000
> > > > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > > > >
> > > > > Cc: Venkat Venkatsubra 
> > > > > Cc: Gia-Khanh Nguyen 
> > > > > Reviewed-by: Xuan Zhuo 
> > > > > Acked-by: Michael S. Tsirkin 
> > > > > Signed-off-by: Jason Wang 
> > > > > ---
> > > > > Changes since V1:
> > > > > - rebase
> > > > > - add ack/review tags
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > ---
> > > > >  drivers/net/virtio_net.c | 94 
> > > > > +++-
> > > > >  1 file changed, 63 insertions(+), 31 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > index 4a802c0ea2cb..69e4ae353c51 100644
> > > > > --- a/drivers/net/virtio_net.c
> > > > > +++ b/drivers/net/virtio_net.c
> > > > > @@ -433,6 +433,12 @@ struct virtnet_info {
> > > > >   /* The lock to synchronize the access to refill_enabled */
> > > > >   spinlock_t refill_lock;
> > > > >
> > > > > + /* Is config change enabled? */
> > > > > + bool config_change_enabled;
> > > > > +
> > > > > + /* The lock to synchronize the access to config_change_enabled 
> > > > > */
> > > > > + spinlock_t config_change_lock;
> > > > > +
> > > > >   /* Work struct for config space updates */
> > > > >   struct work_struct config_work;
> > > > >
> > > >
> > > >
> > > > But we already have dev->config_lock and dev->config_enabled.
> > > >
> > > > And it actually works better - instead of discarding config
> > > > change events it defers them until enabled.
> > > >
> > >
> > > Yes but then both virtio-net driver and virtio core can ask to enable
> > > and disable and then we need some kind of synchronization which is
> > > non-trivial.
> >
> > Well for core it happens on bring up path before driver works
> > and later on tear down after it is gone.
> > So I do not think they ever do it at the same time.
>
> For example, there could be a suspend/resume when the admin state is down.
>
> >
> >
> > > And device enabling on the core is different from bringing the device
> > > up in the networking subsystem. Here we just delay to deal with the
> > > config change interrupt on ndo_open(). (E.g try to ack announce is
> > > meaningless when the device is down).
> > >
> > > Thanks
> >
> > another thing is that it is better not to re-read all config
> > on link up if there was no config interrupt - less vm exits.
>
> Yes, but it should not matter much as it's done in the ndo_open().

Michael, any more comments on this?

Please confirm if this patch is ok or not. If you prefer to reuse the
config_disable() I can change it from a boolean to a counter that
allows to be nested.

Thanks

>
> Thanks
>
> >
> > --
> > MST
> >

Re: [PATCH net-next V2] virtio-net: synchronize operstate with admin state on up/down

2024-05-30 Thread Jason Wang

On Thu, May 30, 2024 at 9:09 PM Michael S. Tsirkin  wrote:
>
> On Thu, May 30, 2024 at 06:29:51PM +0800, Jason Wang wrote:
> > On Thu, May 30, 2024 at 2:10 PM Michael S. Tsirkin  wrote:
> > >
> > > On Thu, May 30, 2024 at 11:20:55AM +0800, Jason Wang wrote:
> > > > This patch synchronize operstate with admin state per RFC2863.
> > > >
> > > > This is done by trying to toggle the carrier upon open/close and
> > > > synchronize with the config change work. This allows propagate status
> > > > correctly to stacked devices like:
> > > >
> > > > ip link add link enp0s3 macvlan0 type macvlan
> > > > ip link set link enp0s3 down
> > > > ip link show
> > > >
> > > > Before this patch:
> > > >
> > > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN 
> > > > mode DEFAULT group default qlen 1000
> > > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > > ..
> > > > 5: macvlan0@enp0s3:  mtu 1500 
> > > > qdisc noqueue state UP mode DEFAULT group default qlen 1000
> > > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > > >
> > > > After this patch:
> > > >
> > > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN 
> > > > mode DEFAULT group default qlen 1000
> > > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > > ...
> > > > 5: macvlan0@enp0s3:  mtu 1500 
> > > > qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
> > > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > > >
> > > > Cc: Venkat Venkatsubra 
> > > > Cc: Gia-Khanh Nguyen 
> > > > Reviewed-by: Xuan Zhuo 
> > > > Acked-by: Michael S. Tsirkin 
> > > > Signed-off-by: Jason Wang 
> > > > ---
> > > > Changes since V1:
> > > > - rebase
> > > > - add ack/review tags
> > >
> > >
> > >
> > >
> > >
> > > > ---
> > > >  drivers/net/virtio_net.c | 94 +++-
> > > >  1 file changed, 63 insertions(+), 31 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index 4a802c0ea2cb..69e4ae353c51 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -433,6 +433,12 @@ struct virtnet_info {
> > > >   /* The lock to synchronize the access to refill_enabled */
> > > >   spinlock_t refill_lock;
> > > >
> > > > + /* Is config change enabled? */
> > > > + bool config_change_enabled;
> > > > +
> > > > + /* The lock to synchronize the access to config_change_enabled */
> > > > + spinlock_t config_change_lock;
> > > > +
> > > >   /* Work struct for config space updates */
> > > >   struct work_struct config_work;
> > > >
> > >
> > >
> > > But we already have dev->config_lock and dev->config_enabled.
> > >
> > > And it actually works better - instead of discarding config
> > > change events it defers them until enabled.
> > >
> >
> > Yes but then both virtio-net driver and virtio core can ask to enable
> > and disable and then we need some kind of synchronization which is
> > non-trivial.
>
> Well for core it happens on bring up path before driver works
> and later on tear down after it is gone.
> So I do not think they ever do it at the same time.

For example, there could be a suspend/resume when the admin state is down.

>
>
> > And device enabling on the core is different from bringing the device
> > up in the networking subsystem. Here we just delay to deal with the
> > config change interrupt on ndo_open(). (E.g try to ack announce is
> > meaningless when the device is down).
> >
> > Thanks
>
> another thing is that it is better not to re-read all config
> on link up if there was no config interrupt - less vm exits.

Yes, but it should not matter much as it's done in the ndo_open().

Thanks

>
> --
> MST
>

Re: [PATCH net-next V2] virtio-net: synchronize operstate with admin state on up/down

2024-05-30 Thread Jason Wang

On Thu, May 30, 2024 at 6:29 PM Jason Wang  wrote:
>
> On Thu, May 30, 2024 at 2:10 PM Michael S. Tsirkin  wrote:
> >
> > On Thu, May 30, 2024 at 11:20:55AM +0800, Jason Wang wrote:
> > > This patch synchronize operstate with admin state per RFC2863.
> > >
> > > This is done by trying to toggle the carrier upon open/close and
> > > synchronize with the config change work. This allows propagate status
> > > correctly to stacked devices like:
> > >
> > > ip link add link enp0s3 macvlan0 type macvlan
> > > ip link set link enp0s3 down
> > > ip link show
> > >
> > > Before this patch:
> > >
> > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN 
> > > mode DEFAULT group default qlen 1000
> > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > ..
> > > 5: macvlan0@enp0s3:  mtu 1500 
> > > qdisc noqueue state UP mode DEFAULT group default qlen 1000
> > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > >
> > > After this patch:
> > >
> > > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN 
> > > mode DEFAULT group default qlen 1000
> > > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > > ...
> > > 5: macvlan0@enp0s3:  mtu 1500 
> > > qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
> > > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> > >
> > > Cc: Venkat Venkatsubra 
> > > Cc: Gia-Khanh Nguyen 
> > > Reviewed-by: Xuan Zhuo 
> > > Acked-by: Michael S. Tsirkin 
> > > Signed-off-by: Jason Wang 
> > > ---
> > > Changes since V1:
> > > - rebase
> > > - add ack/review tags
> >
> >
> >
> >
> >
> > > ---
> > >  drivers/net/virtio_net.c | 94 +++-
> > >  1 file changed, 63 insertions(+), 31 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 4a802c0ea2cb..69e4ae353c51 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -433,6 +433,12 @@ struct virtnet_info {
> > >   /* The lock to synchronize the access to refill_enabled */
> > >   spinlock_t refill_lock;
> > >
> > > + /* Is config change enabled? */
> > > + bool config_change_enabled;
> > > +
> > > + /* The lock to synchronize the access to config_change_enabled */
> > > + spinlock_t config_change_lock;
> > > +
> > >   /* Work struct for config space updates */
> > >   struct work_struct config_work;
> > >
> >
> >
> > But we already have dev->config_lock and dev->config_enabled.
> >
> > And it actually works better - instead of discarding config
> > change events it defers them until enabled.
> >
>
> Yes but then both virtio-net driver and virtio core can ask to enable
> and disable and then we need some kind of synchronization which is
> non-trivial.
>
> And device enabling on the core is different from bringing the device
> up in the networking subsystem. Here we just delay to deal with the
> config change interrupt on ndo_open(). (E.g try to ack announce is
> meaningless when the device is down).

Or maybe you meant to make config_enabled a nest counter?

Thanks

>
> Thanks

Re: [PATCH net-next V2] virtio-net: synchronize operstate with admin state on up/down

2024-05-30 Thread Jason Wang

On Thu, May 30, 2024 at 2:10 PM Michael S. Tsirkin  wrote:
>
> On Thu, May 30, 2024 at 11:20:55AM +0800, Jason Wang wrote:
> > This patch synchronize operstate with admin state per RFC2863.
> >
> > This is done by trying to toggle the carrier upon open/close and
> > synchronize with the config change work. This allows propagate status
> > correctly to stacked devices like:
> >
> > ip link add link enp0s3 macvlan0 type macvlan
> > ip link set link enp0s3 down
> > ip link show
> >
> > Before this patch:
> >
> > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN mode 
> > DEFAULT group default qlen 1000
> > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > ..
> > 5: macvlan0@enp0s3:  mtu 1500 qdisc 
> > noqueue state UP mode DEFAULT group default qlen 1000
> > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> >
> > After this patch:
> >
> > 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN mode 
> > DEFAULT group default qlen 1000
> > link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> > ...
> > 5: macvlan0@enp0s3:  mtu 1500 
> > qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
> > link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> >
> > Cc: Venkat Venkatsubra 
> > Cc: Gia-Khanh Nguyen 
> > Reviewed-by: Xuan Zhuo 
> > Acked-by: Michael S. Tsirkin 
> > Signed-off-by: Jason Wang 
> > ---
> > Changes since V1:
> > - rebase
> > - add ack/review tags
>
>
>
>
>
> > ---
> >  drivers/net/virtio_net.c | 94 +++-
> >  1 file changed, 63 insertions(+), 31 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 4a802c0ea2cb..69e4ae353c51 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -433,6 +433,12 @@ struct virtnet_info {
> >   /* The lock to synchronize the access to refill_enabled */
> >   spinlock_t refill_lock;
> >
> > + /* Is config change enabled? */
> > + bool config_change_enabled;
> > +
> > + /* The lock to synchronize the access to config_change_enabled */
> > + spinlock_t config_change_lock;
> > +
> >   /* Work struct for config space updates */
> >   struct work_struct config_work;
> >
>
>
> But we already have dev->config_lock and dev->config_enabled.
>
> And it actually works better - instead of discarding config
> change events it defers them until enabled.
>

Yes but then both virtio-net driver and virtio core can ask to enable
and disable and then we need some kind of synchronization which is
non-trivial.

And device enabling on the core is different from bringing the device
up in the networking subsystem. Here we just delay to deal with the
config change interrupt on ndo_open(). (E.g try to ack announce is
meaningless when the device is down).

Thanks

[PATCH net-next V2] virtio-net: synchronize operstate with admin state on up/down

2024-05-29 Thread Jason Wang

This patch synchronize operstate with admin state per RFC2863.

This is done by trying to toggle the carrier upon open/close and
synchronize with the config change work. This allows propagate status
correctly to stacked devices like:

ip link add link enp0s3 macvlan0 type macvlan
ip link set link enp0s3 down
ip link show

Before this patch:

3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN mode 
DEFAULT group default qlen 1000
link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
..
5: macvlan0@enp0s3:  mtu 1500 qdisc 
noqueue state UP mode DEFAULT group default qlen 1000
link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff

After this patch:

3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN mode 
DEFAULT group default qlen 1000
link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
...
5: macvlan0@enp0s3:  mtu 1500 qdisc 
noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff

Cc: Venkat Venkatsubra 
Cc: Gia-Khanh Nguyen 
Reviewed-by: Xuan Zhuo 
Acked-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
Changes since V1:
- rebase
- add ack/review tags
---
 drivers/net/virtio_net.c | 94 +++-
 1 file changed, 63 insertions(+), 31 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4a802c0ea2cb..69e4ae353c51 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -433,6 +433,12 @@ struct virtnet_info {
/* The lock to synchronize the access to refill_enabled */
spinlock_t refill_lock;
 
+   /* Is config change enabled? */
+   bool config_change_enabled;
+
+   /* The lock to synchronize the access to config_change_enabled */
+   spinlock_t config_change_lock;
+
/* Work struct for config space updates */
struct work_struct config_work;
 
@@ -623,6 +629,20 @@ static void disable_delayed_refill(struct virtnet_info *vi)
spin_unlock_bh(>refill_lock);
 }
 
+static void enable_config_change(struct virtnet_info *vi)
+{
+   spin_lock_irq(>config_change_lock);
+   vi->config_change_enabled = true;
+   spin_unlock_irq(>config_change_lock);
+}
+
+static void disable_config_change(struct virtnet_info *vi)
+{
+   spin_lock_irq(>config_change_lock);
+   vi->config_change_enabled = false;
+   spin_unlock_irq(>config_change_lock);
+}
+
 static void enable_rx_mode_work(struct virtnet_info *vi)
 {
rtnl_lock();
@@ -2421,6 +2441,25 @@ static int virtnet_enable_queue_pair(struct virtnet_info 
*vi, int qp_index)
return err;
 }
 
+static void virtnet_update_settings(struct virtnet_info *vi)
+{
+   u32 speed;
+   u8 duplex;
+
+   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX))
+   return;
+
+   virtio_cread_le(vi->vdev, struct virtio_net_config, speed, );
+
+   if (ethtool_validate_speed(speed))
+   vi->speed = speed;
+
+   virtio_cread_le(vi->vdev, struct virtio_net_config, duplex, );
+
+   if (ethtool_validate_duplex(duplex))
+   vi->duplex = duplex;
+}
+
 static int virtnet_open(struct net_device *dev)
 {
struct virtnet_info *vi = netdev_priv(dev);
@@ -2439,6 +2478,18 @@ static int virtnet_open(struct net_device *dev)
goto err_enable_qp;
}
 
+   /* Assume link up if device can't report link status,
+  otherwise get link status from config. */
+   netif_carrier_off(dev);
+   if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
+   enable_config_change(vi);
+   schedule_work(>config_work);
+   } else {
+   vi->status = VIRTIO_NET_S_LINK_UP;
+   virtnet_update_settings(vi);
+   netif_carrier_on(dev);
+   }
+
return 0;
 
 err_enable_qp:
@@ -2875,12 +2926,19 @@ static int virtnet_close(struct net_device *dev)
disable_delayed_refill(vi);
/* Make sure refill_work doesn't re-enable napi! */
cancel_delayed_work_sync(>refill);
+   /* Make sure config notification doesn't schedule config work */
+   disable_config_change(vi);
+   /* Make sure status updating is cancelled */
+   cancel_work_sync(>config_work);
 
for (i = 0; i < vi->max_queue_pairs; i++) {
virtnet_disable_queue_pair(vi, i);
cancel_work_sync(>rq[i].dim.work);
}
 
+   vi->status &= ~VIRTIO_NET_S_LINK_UP;
+   netif_carrier_off(dev);
+
return 0;
 }
 
@@ -4583,25 +4641,6 @@ static void virtnet_init_settings(struct net_device *dev)
vi->duplex = DUPLEX_UNKNOWN;
 }
 
-static void virtnet_update_settings(struct virtnet_info *vi)
-{
-   u32 speed;
-   u8 duplex;
-
-   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX))
-   return;
-
-   virtio_cread_le(vi->vdev, struct virtio_net_config, speed, );
-

Re: [PATCH V3 1/3] vhost-vdpa: flush workers on suspend

2024-05-21 Thread Jason Wang

On Tue, May 21, 2024 at 9:39 PM Steven Sistare
 wrote:
>
> On 5/20/2024 10:28 PM, Jason Wang wrote:
> > On Mon, May 20, 2024 at 11:21 PM Steve Sistare
> >  wrote:
> >>
> >> Flush to guarantee no workers are running when suspend returns.
> >>
> >> Fixes: f345a0143b4d ("vhost-vdpa: uAPI to suspend the device")
> >> Signed-off-by: Steve Sistare 
> >> Acked-by: Eugenio Pérez 
> >> ---
> >>   drivers/vhost/vdpa.c | 3 +++
> >>   1 file changed, 3 insertions(+)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index ba52d128aeb7..189596caaec9 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -594,6 +594,7 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> >>  struct vdpa_device *vdpa = v->vdpa;
> >>  const struct vdpa_config_ops *ops = vdpa->config;
> >>  int ret;
> >> +   struct vhost_dev *vdev = >vdev;
> >>
> >>  if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> >>  return 0;
> >> @@ -601,6 +602,8 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> >>  if (!ops->suspend)
> >>  return -EOPNOTSUPP;
> >>
> >> +   vhost_dev_flush(vdev);
> >
> > vhost-vDPA doesn't use workers, see:
> >
> >  vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
> > vhost_vdpa_process_iotlb_msg);
> >
> > So I wonder if this is a must.
>
> True, but I am adding this to be future proof.  I could instead log a warning
> or an error message if vhost_vdpa_suspend is called and 
> v->vdev.use_worker=true,
> but IMO we should just fix it, given that the fix is trivial.

I meant we need to know if it fixes any actual issue or not.

Thanks

>
> - Steve
>
>
>

Re: [PATCH V3 2/3] vduse: suspend

2024-05-21 Thread Jason Wang

On Tue, May 21, 2024 at 9:39 PM Steven Sistare
 wrote:
>
> On 5/20/2024 10:30 PM, Jason Wang wrote:
> > On Mon, May 20, 2024 at 11:21 PM Steve Sistare
> >  wrote:
> >>
> >> Support the suspend operation.  There is little to do, except flush to
> >> guarantee no workers are running when suspend returns.
> >>
> >> Signed-off-by: Steve Sistare 
> >> ---
> >>   drivers/vdpa/vdpa_user/vduse_dev.c | 24 
> >>   1 file changed, 24 insertions(+)
> >>
> >> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> >> b/drivers/vdpa/vdpa_user/vduse_dev.c
> >> index 73c89701fc9d..7dc46f771f12 100644
> >> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> >> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >> @@ -472,6 +472,18 @@ static void vduse_dev_reset(struct vduse_dev *dev)
> >>  up_write(>rwsem);
> >>   }
> >>
> >> +static void vduse_flush_work(struct vduse_dev *dev)
> >> +{
> >> +   flush_work(>inject);
> >> +
> >> +   for (int i = 0; i < dev->vq_num; i++) {
> >> +   struct vduse_virtqueue *vq = dev->vqs[i];
> >> +
> >> +   flush_work(>inject);
> >> +   flush_work(>kick);
> >> +   }
> >> +}
> >> +
> >>   static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> >>  u64 desc_area, u64 driver_area,
> >>  u64 device_area)
> >> @@ -724,6 +736,17 @@ static int vduse_vdpa_reset(struct vdpa_device *vdpa)
> >>  return ret;
> >>   }
> >>
> >> +static int vduse_vdpa_suspend(struct vdpa_device *vdpa)
> >> +{
> >> +   struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >> +
> >> +   down_write(>rwsem);
> >> +   vduse_flush_work(dev);
> >> +   up_write(>rwsem);
> >
> > Can this forbid the new work to be scheduled?
>
> Are you suggesting I return an error below if the dev is suspended?
> I can do that.

I mean the irq injection work can still be scheduled after vduse_vdpa_suspend().

>
> However, I now suspect this implementation of vduse_vdpa_suspend is not
> complete in other ways, so I withdraw this patch pending future work.
> Thanks for looking at it.

Ok.

Thanks

>
> - Steve
>
> > static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
> >  struct work_struct *irq_work,
> >  int irq_effective_cpu)
> > {
> >  int ret = -EINVAL;
> >
> >  down_read(>rwsem);
> >  if (!(dev->status & VIRTIO_CONFIG_S_DRIVER_OK))
> >  goto unlock;
> >
> >  ret = 0;
> >  if (irq_effective_cpu == IRQ_UNBOUND)
> >  queue_work(vduse_irq_wq, irq_work);
> >  else
> >  queue_work_on(irq_effective_cpu,
> >vduse_irq_bound_wq, irq_work);
> > unlock:
> >  up_read(>rwsem);
> >
> >  return ret;
> > }
> >
> > Thanks
> >
> >> +
> >> +   return 0;
> >> +}
> >> +
> >>   static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
> >>   {
> >>  struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> >> @@ -806,6 +829,7 @@ static const struct vdpa_config_ops 
> >> vduse_vdpa_config_ops = {
> >>  .set_vq_affinity= vduse_vdpa_set_vq_affinity,
> >>  .get_vq_affinity= vduse_vdpa_get_vq_affinity,
> >>  .reset  = vduse_vdpa_reset,
> >> +   .suspend= vduse_vdpa_suspend,
> >>  .set_map= vduse_vdpa_set_map,
> >>  .free   = vduse_vdpa_free,
> >>   };
> >> --
> >> 2.39.3
> >>
> >
>

Re: [PATCH V3 3/3] vdpa_sim: flush workers on suspend

2024-05-21 Thread Jason Wang

On Tue, May 21, 2024 at 9:39 PM Steven Sistare
 wrote:
>
> On 5/20/2024 10:32 PM, Jason Wang wrote:
> > On Mon, May 20, 2024 at 11:21 PM Steve Sistare
> >  wrote:
> >>
> >> Flush to guarantee no workers are running when suspend returns.
> >> Add a lock to enforce ordering between clearing running, flushing,
> >> and posting new work in vdpasim_kick_vq.  It must be a spin lock
> >> because vdpasim_kick_vq may be reached va eventfd_write.
> >>
> >> Signed-off-by: Steve Sistare 
> >> ---
> >>   drivers/vdpa/vdpa_sim/vdpa_sim.c | 16 ++--
> >>   drivers/vdpa/vdpa_sim/vdpa_sim.h |  1 +
> >>   2 files changed, 15 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c 
> >> b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> >> index 8ffea8430f95..67ed49d95bf0 100644
> >> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> >> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> >> @@ -322,7 +322,7 @@ static u16 vdpasim_get_vq_size(struct vdpa_device 
> >> *vdpa, u16 idx)
> >>  return VDPASIM_QUEUE_MAX;
> >>   }
> >>
> >> -static void vdpasim_kick_vq(struct vdpa_device *vdpa, u16 idx)
> >> +static void vdpasim_do_kick_vq(struct vdpa_device *vdpa, u16 idx)
> >>   {
> >>  struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> >>  struct vdpasim_virtqueue *vq = >vqs[idx];
> >> @@ -337,6 +337,15 @@ static void vdpasim_kick_vq(struct vdpa_device *vdpa, 
> >> u16 idx)
> >>  vdpasim_schedule_work(vdpasim);
> >>   }
> >>
> >> +static void vdpasim_kick_vq(struct vdpa_device *vdpa, u16 idx)
> >> +{
> >> +   struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> >> +
> >> +   spin_lock(>kick_lock);
> >> +   vdpasim_do_kick_vq(vdpa, idx);
> >> +   spin_unlock(>kick_lock);
> >> +}
> >> +
> >>   static void vdpasim_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> >>struct vdpa_callback *cb)
> >>   {
> >> @@ -520,8 +529,11 @@ static int vdpasim_suspend(struct vdpa_device *vdpa)
> >>  struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> >>
> >>  mutex_lock(>mutex);
> >> +   spin_lock(>kick_lock);
> >>  vdpasim->running = false;
> >> +   spin_unlock(>kick_lock);
> >>  mutex_unlock(>mutex);
> >> +   kthread_flush_work(>work);
> >>
> >>  return 0;
> >>   }
> >> @@ -537,7 +549,7 @@ static int vdpasim_resume(struct vdpa_device *vdpa)
> >>  if (vdpasim->pending_kick) {
> >>  /* Process pending descriptors */
> >>  for (i = 0; i < vdpasim->dev_attr.nvqs; ++i)
> >> -   vdpasim_kick_vq(vdpa, i);
> >> +   vdpasim_do_kick_vq(vdpa, i);
> >>
> >>  vdpasim->pending_kick = false;
> >>  }
> >> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.h 
> >> b/drivers/vdpa/vdpa_sim/vdpa_sim.h
> >> index bb137e479763..5eb6ca9c5ec5 100644
> >> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.h
> >> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.h
> >> @@ -75,6 +75,7 @@ struct vdpasim {
> >>  bool pending_kick;
> >>  /* spinlock to synchronize iommu table */
> >>  spinlock_t iommu_lock;
> >> +   spinlock_t kick_lock;
> >
> > It looks to me this is not initialized?
>
> Yup, I lost that line while fiddling with different locking schemes.
> Thanks, will fix in V4.
>
> @@ -236,6 +236,7 @@ struct vdpasim *vdpasim_create(struct vdpasim_dev_attr
> *dev_attr,
>
>  mutex_init(>mutex);
>  spin_lock_init(>iommu_lock);
> +   spin_lock_init(>kick_lock);
>
> With that fix, does this patch earn your RB?

Yes.

Thanks

>
> - Steve
>
> >>   };
> >>
> >>   struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *attr,
> >> --
> >> 2.39.3
> >>
> >
>

Re: [PATCH V3 3/3] vdpa_sim: flush workers on suspend

2024-05-20 Thread Jason Wang

On Mon, May 20, 2024 at 11:21 PM Steve Sistare
 wrote:
>
> Flush to guarantee no workers are running when suspend returns.
> Add a lock to enforce ordering between clearing running, flushing,
> and posting new work in vdpasim_kick_vq.  It must be a spin lock
> because vdpasim_kick_vq may be reached va eventfd_write.
>
> Signed-off-by: Steve Sistare 
> ---
>  drivers/vdpa/vdpa_sim/vdpa_sim.c | 16 ++--
>  drivers/vdpa/vdpa_sim/vdpa_sim.h |  1 +
>  2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.c 
> b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> index 8ffea8430f95..67ed49d95bf0 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.c
> @@ -322,7 +322,7 @@ static u16 vdpasim_get_vq_size(struct vdpa_device *vdpa, 
> u16 idx)
> return VDPASIM_QUEUE_MAX;
>  }
>
> -static void vdpasim_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +static void vdpasim_do_kick_vq(struct vdpa_device *vdpa, u16 idx)
>  {
> struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> struct vdpasim_virtqueue *vq = >vqs[idx];
> @@ -337,6 +337,15 @@ static void vdpasim_kick_vq(struct vdpa_device *vdpa, 
> u16 idx)
> vdpasim_schedule_work(vdpasim);
>  }
>
> +static void vdpasim_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +   struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> +
> +   spin_lock(>kick_lock);
> +   vdpasim_do_kick_vq(vdpa, idx);
> +   spin_unlock(>kick_lock);
> +}
> +
>  static void vdpasim_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
>   struct vdpa_callback *cb)
>  {
> @@ -520,8 +529,11 @@ static int vdpasim_suspend(struct vdpa_device *vdpa)
> struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
>
> mutex_lock(>mutex);
> +   spin_lock(>kick_lock);
> vdpasim->running = false;
> +   spin_unlock(>kick_lock);
> mutex_unlock(>mutex);
> +   kthread_flush_work(>work);
>
> return 0;
>  }
> @@ -537,7 +549,7 @@ static int vdpasim_resume(struct vdpa_device *vdpa)
> if (vdpasim->pending_kick) {
> /* Process pending descriptors */
> for (i = 0; i < vdpasim->dev_attr.nvqs; ++i)
> -   vdpasim_kick_vq(vdpa, i);
> +   vdpasim_do_kick_vq(vdpa, i);
>
> vdpasim->pending_kick = false;
> }
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.h 
> b/drivers/vdpa/vdpa_sim/vdpa_sim.h
> index bb137e479763..5eb6ca9c5ec5 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.h
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.h
> @@ -75,6 +75,7 @@ struct vdpasim {
> bool pending_kick;
> /* spinlock to synchronize iommu table */
> spinlock_t iommu_lock;
> +   spinlock_t kick_lock;

It looks to me this is not initialized?

Thanks

>  };
>
>  struct vdpasim *vdpasim_create(struct vdpasim_dev_attr *attr,
> --
> 2.39.3
>

Re: [PATCH V3 2/3] vduse: suspend

2024-05-20 Thread Jason Wang

On Mon, May 20, 2024 at 11:21 PM Steve Sistare
 wrote:
>
> Support the suspend operation.  There is little to do, except flush to
> guarantee no workers are running when suspend returns.
>
> Signed-off-by: Steve Sistare 
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 24 
>  1 file changed, 24 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 73c89701fc9d..7dc46f771f12 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -472,6 +472,18 @@ static void vduse_dev_reset(struct vduse_dev *dev)
> up_write(>rwsem);
>  }
>
> +static void vduse_flush_work(struct vduse_dev *dev)
> +{
> +   flush_work(>inject);
> +
> +   for (int i = 0; i < dev->vq_num; i++) {
> +   struct vduse_virtqueue *vq = dev->vqs[i];
> +
> +   flush_work(>inject);
> +   flush_work(>kick);
> +   }
> +}
> +
>  static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> u64 desc_area, u64 driver_area,
> u64 device_area)
> @@ -724,6 +736,17 @@ static int vduse_vdpa_reset(struct vdpa_device *vdpa)
> return ret;
>  }
>
> +static int vduse_vdpa_suspend(struct vdpa_device *vdpa)
> +{
> +   struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +   down_write(>rwsem);
> +   vduse_flush_work(dev);
> +   up_write(>rwsem);

Can this forbid the new work to be scheduled?

static int vduse_dev_queue_irq_work(struct vduse_dev *dev,
struct work_struct *irq_work,
int irq_effective_cpu)
{
int ret = -EINVAL;

down_read(>rwsem);
if (!(dev->status & VIRTIO_CONFIG_S_DRIVER_OK))
goto unlock;

ret = 0;
if (irq_effective_cpu == IRQ_UNBOUND)
queue_work(vduse_irq_wq, irq_work);
else
queue_work_on(irq_effective_cpu,
  vduse_irq_bound_wq, irq_work);
unlock:
up_read(>rwsem);

return ret;
}

Thanks

> +
> +   return 0;
> +}
> +
>  static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
>  {
> struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> @@ -806,6 +829,7 @@ static const struct vdpa_config_ops vduse_vdpa_config_ops 
> = {
> .set_vq_affinity= vduse_vdpa_set_vq_affinity,
> .get_vq_affinity= vduse_vdpa_get_vq_affinity,
> .reset  = vduse_vdpa_reset,
> +   .suspend= vduse_vdpa_suspend,
> .set_map= vduse_vdpa_set_map,
> .free   = vduse_vdpa_free,
>  };
> --
> 2.39.3
>

Re: [PATCH V3 1/3] vhost-vdpa: flush workers on suspend

2024-05-20 Thread Jason Wang

On Mon, May 20, 2024 at 11:21 PM Steve Sistare
 wrote:
>
> Flush to guarantee no workers are running when suspend returns.
>
> Fixes: f345a0143b4d ("vhost-vdpa: uAPI to suspend the device")
> Signed-off-by: Steve Sistare 
> Acked-by: Eugenio Pérez 
> ---
>  drivers/vhost/vdpa.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ba52d128aeb7..189596caaec9 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -594,6 +594,7 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> struct vdpa_device *vdpa = v->vdpa;
> const struct vdpa_config_ops *ops = vdpa->config;
> int ret;
> +   struct vhost_dev *vdev = >vdev;
>
> if (!(ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> return 0;
> @@ -601,6 +602,8 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> if (!ops->suspend)
> return -EOPNOTSUPP;
>
> +   vhost_dev_flush(vdev);

vhost-vDPA doesn't use workers, see:

vhost_dev_init(dev, vqs, nvqs, 0, 0, 0, false,
   vhost_vdpa_process_iotlb_msg);

So I wonder if this is a must.

Thanks

> +
> ret = ops->suspend(vdpa);
> if (!ret)
> v->suspended = true;
> --
> 2.39.3
>

Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

2024-05-16 Thread Jason Wang

-for-6.9-rc7-fixes' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab)
>>
>>
>> I add some debug message on the latest kernel, and do above steps to
>> trigger "suspen/resume". Everything of VM is OK, VM could suspend/resume
>> successfully.
>> Follwing is the kernel log:
>> 
>> 
>> May  6 15:59:52 feliu-vm kernel: [   43.446737] PM: suspend entry (deep)
>> May  6 16:00:04 feliu-vm kernel: [   43.467640] Filesystems sync: 0.020
>> seconds
>> May  6 16:00:04 feliu-vm kernel: [   43.467923] Freezing user space
>> processes
>> May  6 16:00:04 feliu-vm kernel: [   43.470294] Freezing user space
>> processes completed (elapsed 0.002 seconds)
>> May  6 16:00:04 feliu-vm kernel: [   43.470299] OOM killer disabled.
>> May  6 16:00:04 feliu-vm kernel: [   43.470301] Freezing remaining
>> freezable tasks
>> May  6 16:00:04 feliu-vm kernel: [   43.471482] Freezing remaining
>> freezable tasks completed (elapsed 0.001 seconds)
>> May  6 16:00:04 feliu-vm kernel: [   43.471495] printk: Suspending
>> console(s) (use no_console_suspend to debug)
>> May  6 16:00:04 feliu-vm kernel: [   43.474034] virtio_net virtio0:
>> godeng virtio device freeze
>> May  6 16:00:04 feliu-vm kernel: [   43.475714] virtio_net virtio0 ens3:
>> godfeng virtnet_freeze done
>> May  6 16:00:04 feliu-vm kernel: [   43.475717] virtio_net virtio0:
>> godfeng VIRTIO_F_ADMIN_VQ not enabled
>> May  6 16:00:04 feliu-vm kernel: [   43.475719] virtio_net virtio0:
>> godeng virtio device freeze done
>> 
>> May  6 16:00:04 feliu-vm kernel: [   43.535382] smpboot: CPU 1 is now
>> offline
>> May  6 16:00:04 feliu-vm kernel: [   43.537283] IRQ fixup: irq 1 move in
>> progress, old vector 32
>> May  6 16:00:04 feliu-vm kernel: [   43.538504] smpboot: CPU 2 is now
>> offline
>> May  6 16:00:04 feliu-vm kernel: [   43.541392] smpboot: CPU 3 is now
>> offline
>>
>> ..
>>
>> May  6 16:00:04 feliu-vm kernel: [   54.973285] smpboot: Booting Node 0
>> Processor 15 APIC 0xf
>> May  6 16:00:04 feliu-vm kernel: [   54.975190] CPU15 is up
>> May  6 16:00:04 feliu-vm kernel: [   54.976011] ACPI: PM: Waking up from
>> system sleep state S3
>> May  6 16:00:04 feliu-vm kernel: [   54.986071] virtio_net virtio0:
>> godeng virtio device restore
>> May  6 16:00:04 feliu-vm kernel: [   54.987563] virtio_net virtio0 ens3:
>> godfeng virtnet_restore done
>> May  6 16:00:04 feliu-vm kernel: [   54.987635] virtio_net virtio0:
>> godfeng: virtio device restore done
>> ..
>> May  6 16:00:04 feliu-vm kernel: [   55.307221] ata8: SATA link down
>> (SStatus 0 SControl 300)
>> May  6 16:00:04 feliu-vm kernel: [   55.442048] OOM killer enabled.
>> May  6 16:00:04 feliu-vm kernel: [   55.442051] Restarting tasks ... done.
>> May  6 16:00:04 feliu-vm kernel: [   55.443576] random: crng reseeded on
>> system resumption
>> May  6 16:00:04 feliu-vm kernel: [   55.443582] PM: suspend exit
>>
>> 
>>
>> Attachment is the full kernel log. I think maybe it is some configration
>> error.
>>
>>
>> Thanks
>> Feng
>>
>>
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: sd 0:0:1:0: [sda] Synchronizing SCSI cache
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: Some devices failed to suspend, or early wake event detected
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: OOM killer enabled.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: Restarting tasks ... done.
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: random: crng reseeded on system resumption
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend exit
>> > May 08 11:08:42 kernel-test-202405080702.c.ubuntu-catred.internal
>> > kernel: PM: suspend entry (s2idle)
>> > -- Boot 61828bc938b44fc68a8aeedc16a23a9d --
>> > May 08 11:09:03 localhost kernel: Linux version 6.8.0-1007-gcp
>> > (buildd@lcy02-amd64-079) (x86_64-linux-gnu-gcc-13 (Ubuntu
>> > 13.2.0-23ubuntu4) 13.2.0, GNU ld (GNU Binutils for Ubuntu) 2.42)
>> > #7-Ubuntu SMP Sat Apr 20 00:58:31 UTC 2024 (Ubuntu 6.8.0-1007.7-gcp 6.8.1)
>> > May 08 11:09:03 localhost kernel: Command line:

Re: [PATCH] vhost/vsock: always initialize seqpacket_allow

2024-05-15 Thread Jason Wang

On Wed, May 15, 2024 at 11:05 PM Michael S. Tsirkin  wrote:
>
> There are two issues around seqpacket_allow:
> 1. seqpacket_allow is not initialized when socket is
>created. Thus if features are never set, it will be
>read uninitialized.
> 2. if VIRTIO_VSOCK_F_SEQPACKET is set and then cleared,
>then seqpacket_allow will not be cleared appropriately
>(existing apps I know about don't usually do this but
> it's legal and there's no way to be sure no one relies
> on this).
>
> To fix:
> - initialize seqpacket_allow after allocation
> - set it unconditionally in set_features
>
> Reported-by: syzbot+6c21aeb59d0e82eb2...@syzkaller.appspotmail.com
> Reported-by: Jeongjun Park 
> Fixes: ced7b713711f ("vhost/vsock: support SEQPACKET for transport").
> Cc: Arseny Krasnov 
> Cc: David S. Miller 
> Cc: Stefan Hajnoczi 
> Signed-off-by: Michael S. Tsirkin 
> Acked-by: Arseniy Krasnov 
> Tested-by: Arseniy Krasnov 
>

Acked-by: Jason Wang 

Thanks

Re: [REGRESSION][v6.8-rc1] virtio-pci: Introduce admin virtqueue

2024-05-07 Thread Jason Wang

On Sat, May 4, 2024 at 2:10 AM Joseph Salisbury
 wrote:
>
> Hi Feng,
>
> During testing, a kernel bug was identified with the suspend/resume
> functionality on instances running in a public cloud [0].  This bug is a
> regression introduced in v6.8-rc1.  After a kernel bisect, the following
> commit was identified as the cause of the regression:
>
> fd27ef6b44be  ("virtio-pci: Introduce admin virtqueue")

Have a quick glance at the patch it seems it should not damage the
freeze/restore as it should behave as in the past.

But I found something interesting:

1) assumes 1 admin vq which is not what spec said
2) special function for admin virtqueue during freeze/restore, but it
doesn't do anything special than del_vq()
3) lack real users but I guess e.g the destroy_avq() needs to be
synchronized with the one that is using admin virtqueue

>
> I was hoping to get your feedback, since you are the patch author. Do
> you think gathering any additional data will help diagnose this issue?

Yes, please show us

1) the kernel log here.
2) the features that the device has like
/sys/bus/virtio/devices/virtio0/features

> This commit is depended upon by other virtio commits, so a revert test
> is not really straight forward without reverting all the dependencies.
> Any ideas you have would be greatly appreciated.

Thanks

>
>
> Thanks,
>
> Joe
>
> http://pad.lv/2063315
>

Re: [PATCH] virtio_net: Warn if insufficient queue length for transmitting

2024-05-05 Thread Jason Wang

On Wed, May 1, 2024 at 4:07 AM Michael S. Tsirkin  wrote:
>
> On Tue, Apr 30, 2024 at 03:35:09PM -0400, Darius Rad wrote:
> > The transmit queue is stopped when the number of free queue entries is less
> > than 2+MAX_SKB_FRAGS, in start_xmit().  If the queue length (QUEUE_NUM_MAX)
> > is less than then this, transmission will immediately trigger a netdev
> > watchdog timeout.  Report this condition earlier and more directly.
> >
> > Signed-off-by: Darius Rad 
> > ---
> >  drivers/net/virtio_net.c | 3 +++
> >  1 file changed, 3 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 115c3c5414f2..72ee8473b61c 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -4917,6 +4917,9 @@ static int virtnet_probe(struct virtio_device *vdev)
> >   set_bit(guest_offloads[i], >guest_offloads);
> >   vi->guest_offloads_capable = vi->guest_offloads;
> >
> > + if (virtqueue_get_vring_size(vi->sq->vq) < 2 + MAX_SKB_FRAGS)
> > + netdev_warn_once(dev, "not enough queue entries, expect xmit 
> > timeout\n");
> > +
>
> How about actually fixing it though? E.g. by linearizing...

Actually, the linearing is only needed for the case when the indirect
descriptor is not supported.

>
> It also bothers me that there's practically
> /proc/sys/net/core/max_skb_frags
> and if that's low then things could actually work.

Probably not as it won't exceed MAX_SKB_FRAGS.

>
> Finally, while originally it was just 17 typically, now it's
> configurable. So it's possible that you change the config to make big
> tcp

Note that virtio-net doesn't fully support big TCP.

> work better and device stops working while it worked fine
> previously.

For this patch, I guess not as we had:

if (sq->vq->num_free < 2+MAX_SKB_FRAGS)

in the tx path. So it won't even work before this patch.

Thanks

>
>
> >   pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
> >dev->name, max_queue_pairs);
> >
> > --
> > 2.39.2
>

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-24 Thread Jason Wang

On Wed, Apr 24, 2024 at 5:51 PM Michael S. Tsirkin  wrote:
>
> On Wed, Apr 24, 2024 at 08:44:10AM +0800, Jason Wang wrote:
> > On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> > > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > > > > and vduse_alloc_reconnnect_info_mem
> > > > > > > > These functions allow vduse to allocate and free memory for 
> > > > > > > > reconnection
> > > > > > > > information. The amount of memory allocated is vq_num pages.
> > > > > > > > Each VQS will map its own page where the reconnection 
> > > > > > > > information will be saved
> > > > > > > >
> > > > > > > > Signed-off-by: Cindy Lu 
> > > > > > > > ---
> > > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > > > > ++
> > > > > > > >  1 file changed, 40 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > > > > >   int irq_effective_cpu;
> > > > > > > >   struct cpumask irq_affinity;
> > > > > > > >   struct kobject kobj;
> > > > > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > > > > >  };
> > > > > > > >
> > > > > > > >  struct vduse_dev;
> > > > > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > > > > >
> > > > > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > > > > >  }
> > > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev 
> > > > > > > > *dev)
> > > > > > > > +{
> > > > > > > > + unsigned long vaddr = 0;
> > > > > > > > + struct vduse_virtqueue *vq;
> > > > > > > > +
> > > > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > > > > + vq = dev->vqs[i];
> > > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > > > > >
> > > > > > >
> > > > > > > I don't get why you insist on stealing kernel memory for something
> > > > > > > that is just used by userspace to store data for its own use.
> > > > > > > Userspace does not lack ways to persist data, for example,
> > > > > > > create a regular file anywhere in the filesystem.
> > > > > >
> > > > > > Good point. So the motivation here is to:
> > > > > >
> > > > > > 1) be self contained, no dependency for high speed persist data
> > > > > > storage like tmpfs
> > > > >
> > > > > No idea what this means.
> > > >
> > > > I mean a regular file may slow down the datapath performance, so
> > > > usually the application will try to use tmpfs and other which is a
> > > > dependency for implementing the reconnection.
> > >
> > > Are we worried about systems without tmpfs now?
> >
> > Yes.
>
> Why? Who ships these?

Not sure, but it could be disabled or unmounted. I'm not sure make
VDUSE depends on TMPFS is a good idea.

Thanks

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-23 Thread Jason Wang

On Wed, Apr 24, 2024 at 11:15 AM Cindy Lu  wrote:
>
> On Wed, Apr 24, 2024 at 8:44 AM Jason Wang  wrote:
> >
> > On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin  wrote:
> > >
> > > On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> > > > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > > > > and vduse_alloc_reconnnect_info_mem
> > > > > > > > These functions allow vduse to allocate and free memory for 
> > > > > > > > reconnection
> > > > > > > > information. The amount of memory allocated is vq_num pages.
> > > > > > > > Each VQS will map its own page where the reconnection 
> > > > > > > > information will be saved
> > > > > > > >
> > > > > > > > Signed-off-by: Cindy Lu 
> > > > > > > > ---
> > > > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > > > > ++
> > > > > > > >  1 file changed, 40 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > > > > >   int irq_effective_cpu;
> > > > > > > >   struct cpumask irq_affinity;
> > > > > > > >   struct kobject kobj;
> > > > > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > > > > >  };
> > > > > > > >
> > > > > > > >  struct vduse_dev;
> > > > > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > > > > >
> > > > > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > > > > >  }
> > > > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev 
> > > > > > > > *dev)
> > > > > > > > +{
> > > > > > > > + unsigned long vaddr = 0;
> > > > > > > > + struct vduse_virtqueue *vq;
> > > > > > > > +
> > > > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > > > > + vq = dev->vqs[i];
> > > > > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > > > > >
> > > > > > >
> > > > > > > I don't get why you insist on stealing kernel memory for something
> > > > > > > that is just used by userspace to store data for its own use.
> > > > > > > Userspace does not lack ways to persist data, for example,
> > > > > > > create a regular file anywhere in the filesystem.
> > > > > >
> > > > > > Good point. So the motivation here is to:
> > > > > >
> > > > > > 1) be self contained, no dependency for high speed persist data
> > > > > > storage like tmpfs
> > > > >
> > > > > No idea what this means.
> > > >
> > > > I mean a regular file may slow down the datapath performance, so
> > > > usually the application will try to use tmpfs and other which is a
> > > > dependency for implementing the reconnection.
> > >
> > > Are we worried about systems without tmpfs now?
> >
> > Yes.
> >
> > >
> > >
> > > > >
> > > > > >

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-23 Thread Jason Wang

On Tue, Apr 23, 2024 at 4:42 PM Michael S. Tsirkin  wrote:
>
> On Tue, Apr 23, 2024 at 11:09:59AM +0800, Jason Wang wrote:
> > On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  wrote:
> > >
> > > On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > > > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > > > and vduse_alloc_reconnnect_info_mem
> > > > > > These functions allow vduse to allocate and free memory for 
> > > > > > reconnection
> > > > > > information. The amount of memory allocated is vq_num pages.
> > > > > > Each VQS will map its own page where the reconnection information 
> > > > > > will be saved
> > > > > >
> > > > > > Signed-off-by: Cindy Lu 
> > > > > > ---
> > > > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 
> > > > > > ++
> > > > > >  1 file changed, 40 insertions(+)
> > > > > >
> > > > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > > > >   int irq_effective_cpu;
> > > > > >   struct cpumask irq_affinity;
> > > > > >   struct kobject kobj;
> > > > > > + unsigned long vdpa_reconnect_vaddr;
> > > > > >  };
> > > > > >
> > > > > >  struct vduse_dev;
> > > > > > @@ -1105,6 +1106,38 @@ static void 
> > > > > > vduse_vq_update_effective_cpu(struct vduse_virtqueue *vq)
> > > > > >
> > > > > >   vq->irq_effective_cpu = curr_cpu;
> > > > > >  }
> > > > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > > > > +{
> > > > > > + unsigned long vaddr = 0;
> > > > > > + struct vduse_virtqueue *vq;
> > > > > > +
> > > > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > > > + vq = dev->vqs[i];
> > > > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > > > >
> > > > >
> > > > > I don't get why you insist on stealing kernel memory for something
> > > > > that is just used by userspace to store data for its own use.
> > > > > Userspace does not lack ways to persist data, for example,
> > > > > create a regular file anywhere in the filesystem.
> > > >
> > > > Good point. So the motivation here is to:
> > > >
> > > > 1) be self contained, no dependency for high speed persist data
> > > > storage like tmpfs
> > >
> > > No idea what this means.
> >
> > I mean a regular file may slow down the datapath performance, so
> > usually the application will try to use tmpfs and other which is a
> > dependency for implementing the reconnection.
>
> Are we worried about systems without tmpfs now?

Yes.

>
>
> > >
> > > > 2) standardize the format in uAPI which allows reconnection from
> > > > arbitrary userspace, unfortunately, such effort was removed in new
> > > > versions
> > >
> > > And I don't see why that has to live in the kernel tree either.
> >
> > I can't find a better place, any idea?
> >
> > Thanks
>
>
> Well anywhere on github really. with libvhost-user maybe?
> It's harmless enough in Documentation
> if you like but ties you to the kernel release cycle in a way that
> is completely unnecessary.

Ok.

Thanks

>
> > >
> > > > If the above doesn't make sense, we don't need to offer those pages by 
> > > > VDUSE.
> > > >
> > > > Thanks
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > > + if (vad

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-22 Thread Jason Wang

On Tue, Apr 23, 2024 at 4:05 AM Michael S. Tsirkin  wrote:
>
> On Thu, Apr 18, 2024 at 08:57:51AM +0800, Jason Wang wrote:
> > On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  wrote:
> > >
> > > On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > > > Add the function vduse_alloc_reconnnect_info_mem
> > > > and vduse_alloc_reconnnect_info_mem
> > > > These functions allow vduse to allocate and free memory for reconnection
> > > > information. The amount of memory allocated is vq_num pages.
> > > > Each VQS will map its own page where the reconnection information will 
> > > > be saved
> > > >
> > > > Signed-off-by: Cindy Lu 
> > > > ---
> > > >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 ++
> > > >  1 file changed, 40 insertions(+)
> > > >
> > > > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > > > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > index ef3c9681941e..2da659d5f4a8 100644
> > > > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > > > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> > > >   int irq_effective_cpu;
> > > >   struct cpumask irq_affinity;
> > > >   struct kobject kobj;
> > > > + unsigned long vdpa_reconnect_vaddr;
> > > >  };
> > > >
> > > >  struct vduse_dev;
> > > > @@ -1105,6 +1106,38 @@ static void vduse_vq_update_effective_cpu(struct 
> > > > vduse_virtqueue *vq)
> > > >
> > > >   vq->irq_effective_cpu = curr_cpu;
> > > >  }
> > > > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > > > +{
> > > > + unsigned long vaddr = 0;
> > > > + struct vduse_virtqueue *vq;
> > > > +
> > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > + /*page 0~ vq_num save the reconnect info for vq*/
> > > > + vq = dev->vqs[i];
> > > > + vaddr = get_zeroed_page(GFP_KERNEL);
> > >
> > >
> > > I don't get why you insist on stealing kernel memory for something
> > > that is just used by userspace to store data for its own use.
> > > Userspace does not lack ways to persist data, for example,
> > > create a regular file anywhere in the filesystem.
> >
> > Good point. So the motivation here is to:
> >
> > 1) be self contained, no dependency for high speed persist data
> > storage like tmpfs
>
> No idea what this means.

I mean a regular file may slow down the datapath performance, so
usually the application will try to use tmpfs and other which is a
dependency for implementing the reconnection.

>
> > 2) standardize the format in uAPI which allows reconnection from
> > arbitrary userspace, unfortunately, such effort was removed in new
> > versions
>
> And I don't see why that has to live in the kernel tree either.

I can't find a better place, any idea?

Thanks

>
> > If the above doesn't make sense, we don't need to offer those pages by 
> > VDUSE.
> >
> > Thanks
> >
> >
> > >
> > >
> > >
> > > > + if (vaddr == 0)
> > > > + return -ENOMEM;
> > > > +
> > > > + vq->vdpa_reconnect_vaddr = vaddr;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> > > > +{
> > > > + struct vduse_virtqueue *vq;
> > > > +
> > > > + for (int i = 0; i < dev->vq_num; i++) {
> > > > + vq = dev->vqs[i];
> > > > +
> > > > + if (vq->vdpa_reconnect_vaddr)
> > > > + free_page(vq->vdpa_reconnect_vaddr);
> > > > + vq->vdpa_reconnect_vaddr = 0;
> > > > + }
> > > > +
> > > > + return 0;
> > > > +}
> > > >
> > > >  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> > > >   unsigned long arg)
> > > > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name)
> > > >   mutex_unlock(>lock);
> > > >   return -EBUSY;
> > > >   }
> > > > + vduse_free_reconnnect_info_mem(dev);
> > > > +
> > > >   dev->connected = true;
> > > >   mutex_unlock(>lock);
> > > >
> > > > @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct 
> > > > vduse_dev_config *config,
> > > >   ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num);
> > > >   if (ret)
> > > >   goto err_vqs;
> > > > + ret = vduse_alloc_reconnnect_info_mem(dev);
> > > > + if (ret < 0)
> > > > + goto err_mem;
> > > >
> > > >   __module_get(THIS_MODULE);
> > > >
> > > >   return 0;
> > > >  err_vqs:
> > > >   device_destroy(_class, MKDEV(MAJOR(vduse_major), 
> > > > dev->minor));
> > > > +err_mem:
> > > > + vduse_free_reconnnect_info_mem(dev);
> > > >  err_dev:
> > > >   idr_remove(_idr, dev->minor);
> > > >  err_idr:
> > > > --
> > > > 2.43.0
> > >
>

Re: [PATCH v5 3/5] vduse: Add function to get/free the pages for reconnection

2024-04-17 Thread Jason Wang

On Wed, Apr 17, 2024 at 5:29 PM Michael S. Tsirkin  wrote:
>
> On Fri, Apr 12, 2024 at 09:28:23PM +0800, Cindy Lu wrote:
> > Add the function vduse_alloc_reconnnect_info_mem
> > and vduse_alloc_reconnnect_info_mem
> > These functions allow vduse to allocate and free memory for reconnection
> > information. The amount of memory allocated is vq_num pages.
> > Each VQS will map its own page where the reconnection information will be 
> > saved
> >
> > Signed-off-by: Cindy Lu 
> > ---
> >  drivers/vdpa/vdpa_user/vduse_dev.c | 40 ++
> >  1 file changed, 40 insertions(+)
> >
> > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> > b/drivers/vdpa/vdpa_user/vduse_dev.c
> > index ef3c9681941e..2da659d5f4a8 100644
> > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> > @@ -65,6 +65,7 @@ struct vduse_virtqueue {
> >   int irq_effective_cpu;
> >   struct cpumask irq_affinity;
> >   struct kobject kobj;
> > + unsigned long vdpa_reconnect_vaddr;
> >  };
> >
> >  struct vduse_dev;
> > @@ -1105,6 +1106,38 @@ static void vduse_vq_update_effective_cpu(struct 
> > vduse_virtqueue *vq)
> >
> >   vq->irq_effective_cpu = curr_cpu;
> >  }
> > +static int vduse_alloc_reconnnect_info_mem(struct vduse_dev *dev)
> > +{
> > + unsigned long vaddr = 0;
> > + struct vduse_virtqueue *vq;
> > +
> > + for (int i = 0; i < dev->vq_num; i++) {
> > + /*page 0~ vq_num save the reconnect info for vq*/
> > + vq = dev->vqs[i];
> > + vaddr = get_zeroed_page(GFP_KERNEL);
>
>
> I don't get why you insist on stealing kernel memory for something
> that is just used by userspace to store data for its own use.
> Userspace does not lack ways to persist data, for example,
> create a regular file anywhere in the filesystem.

Good point. So the motivation here is to:

1) be self contained, no dependency for high speed persist data
storage like tmpfs
2) standardize the format in uAPI which allows reconnection from
arbitrary userspace, unfortunately, such effort was removed in new
versions

If the above doesn't make sense, we don't need to offer those pages by VDUSE.

Thanks


>
>
>
> > + if (vaddr == 0)
> > + return -ENOMEM;
> > +
> > + vq->vdpa_reconnect_vaddr = vaddr;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int vduse_free_reconnnect_info_mem(struct vduse_dev *dev)
> > +{
> > + struct vduse_virtqueue *vq;
> > +
> > + for (int i = 0; i < dev->vq_num; i++) {
> > + vq = dev->vqs[i];
> > +
> > + if (vq->vdpa_reconnect_vaddr)
> > + free_page(vq->vdpa_reconnect_vaddr);
> > + vq->vdpa_reconnect_vaddr = 0;
> > + }
> > +
> > + return 0;
> > +}
> >
> >  static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> >   unsigned long arg)
> > @@ -1672,6 +1705,8 @@ static int vduse_destroy_dev(char *name)
> >   mutex_unlock(>lock);
> >   return -EBUSY;
> >   }
> > + vduse_free_reconnnect_info_mem(dev);
> > +
> >   dev->connected = true;
> >   mutex_unlock(>lock);
> >
> > @@ -1855,12 +1890,17 @@ static int vduse_create_dev(struct vduse_dev_config 
> > *config,
> >   ret = vduse_dev_init_vqs(dev, config->vq_align, config->vq_num);
> >   if (ret)
> >   goto err_vqs;
> > + ret = vduse_alloc_reconnnect_info_mem(dev);
> > + if (ret < 0)
> > + goto err_mem;
> >
> >   __module_get(THIS_MODULE);
> >
> >   return 0;
> >  err_vqs:
> >   device_destroy(_class, MKDEV(MAJOR(vduse_major), dev->minor));
> > +err_mem:
> > + vduse_free_reconnnect_info_mem(dev);
> >  err_dev:
> >   idr_remove(_idr, dev->minor);
> >  err_idr:
> > --
> > 2.43.0
>

Re: [PATCH net-next v9] virtio_net: Support RX hash XDP hint

2024-04-17 Thread Jason Wang

On Wed, Apr 17, 2024 at 3:20 PM Liang Chen  wrote:
>
> The RSS hash report is a feature that's part of the virtio specification.
> Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> (still a work in progress as per [1]) support this feature. While the
> capability to obtain the RSS hash has been enabled in the normal path,
> it's currently missing in the XDP path. Therefore, we are introducing
> XDP hints through kfuncs to allow XDP programs to access the RSS hash.
>
> 1.
> https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
>
> Signed-off-by: Liang Chen 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH net-next v8] virtio_net: Support RX hash XDP hint

2024-04-16 Thread Jason Wang

On Tue, Apr 16, 2024 at 2:20 PM Liang Chen  wrote:
>
> The RSS hash report is a feature that's part of the virtio specification.
> Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> (still a work in progress as per [1]) support this feature. While the
> capability to obtain the RSS hash has been enabled in the normal path,
> it's currently missing in the XDP path. Therefore, we are introducing
> XDP hints through kfuncs to allow XDP programs to access the RSS hash.
>
> 1.
> https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
>
> Signed-off-by: Liang Chen 
> ---
>   Changes from v7:
> - use table lookup for rss hash type
>   Changes from v6:
> - fix a coding style issue
>   Changes from v5:
> - Preservation of the hash value has been dropped, following the conclusion
>   from discussions in V3 reviews. The virtio_net driver doesn't
>   accessing/using the virtio_net_hdr after the XDP program execution, so
>   nothing tragic should happen. As to the xdp program, if it smashes the
>   entry in virtio header, it is likely buggy anyways. Additionally, looking
>   up the Intel IGC driver,  it also does not bother with this particular
>   aspect.
> ---
>  drivers/net/virtio_net.c| 42 +
>  include/uapi/linux/virtio_net.h |  1 +
>  2 files changed, 43 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c22d1118a133..1d750009f615 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -4621,6 +4621,47 @@ static void virtnet_set_big_packets(struct 
> virtnet_info *vi, const int mtu)
> }
>  }
>
> +static enum xdp_rss_hash_type
> +virtnet_xdp_rss_type[VIRTIO_NET_HASH_REPORT_MAX_TABLE] = {
> +   [VIRTIO_NET_HASH_REPORT_NONE] = XDP_RSS_TYPE_NONE,
> +   [VIRTIO_NET_HASH_REPORT_IPv4] = XDP_RSS_TYPE_L3_IPV4,
> +   [VIRTIO_NET_HASH_REPORT_TCPv4] = XDP_RSS_TYPE_L4_IPV4_TCP,
> +   [VIRTIO_NET_HASH_REPORT_UDPv4] = XDP_RSS_TYPE_L4_IPV4_UDP,
> +   [VIRTIO_NET_HASH_REPORT_IPv6] = XDP_RSS_TYPE_L3_IPV6,
> +   [VIRTIO_NET_HASH_REPORT_TCPv6] = XDP_RSS_TYPE_L4_IPV6_TCP,
> +   [VIRTIO_NET_HASH_REPORT_UDPv6] = XDP_RSS_TYPE_L4_IPV6_UDP,
> +   [VIRTIO_NET_HASH_REPORT_IPv6_EX] = XDP_RSS_TYPE_L3_IPV6_EX,
> +   [VIRTIO_NET_HASH_REPORT_TCPv6_EX] = XDP_RSS_TYPE_L4_IPV6_TCP_EX,
> +   [VIRTIO_NET_HASH_REPORT_UDPv6_EX] = XDP_RSS_TYPE_L4_IPV6_UDP_EX
> +};
> +
> +static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
> +  enum xdp_rss_hash_type *rss_type)
> +{
> +   const struct xdp_buff *xdp = (void *)_ctx;
> +   struct virtio_net_hdr_v1_hash *hdr_hash;
> +   struct virtnet_info *vi;
> +   u16 hash_report;
> +
> +   if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
> +   return -ENODATA;
> +
> +   vi = netdev_priv(xdp->rxq->dev);
> +   hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len);
> +   hash_report = __le16_to_cpu(hdr_hash->hash_report);
> +
> +   if (hash_report >= VIRTIO_NET_HASH_REPORT_MAX_TABLE)
> +   hash_report = VIRTIO_NET_HASH_REPORT_NONE;
> +
> +   *rss_type = virtnet_xdp_rss_type[hash_report];
> +   *hash = __le32_to_cpu(hdr_hash->hash_value);
> +   return 0;
> +}
> +
> +static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
> +   .xmo_rx_hash= virtnet_xdp_rx_hash,
> +};
> +
>  static int virtnet_probe(struct virtio_device *vdev)
>  {
> int i, err = -ENOMEM;
> @@ -4747,6 +4788,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>   VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
>
> dev->hw_features |= NETIF_F_RXHASH;
> +   dev->xdp_metadata_ops = _xdp_metadata_ops;
> }
>
> if (vi->has_rss_hash_report)
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index cc65ef0f3c3e..3ee695450096 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -176,6 +176,7 @@ struct virtio_net_hdr_v1_hash {
>  #define VIRTIO_NET_HASH_REPORT_IPv6_EX 7
>  #define VIRTIO_NET_HASH_REPORT_TCPv6_EX8
>  #define VIRTIO_NET_HASH_REPORT_UDPv6_EX9
> +#define VIRTIO_NET_HASH_REPORT_MAX_TABLE  10

This should not be part of uAPI. It may confuse the userspace.

Others look good.

Thanks

> __le16 hash_report;
> __le16 padding;
>  };
> --
> 2.40.1
>

Re: [PATCH v5 5/5] Documentation: Add reconnect process for VDUSE

2024-04-15 Thread Jason Wang

On Fri, Apr 12, 2024 at 9:31 PM Cindy Lu  wrote:
>
> Add a document explaining the reconnect process, including what the
> Userspace App needs to do and how it works with the kernel.
>
> Signed-off-by: Cindy Lu 
> ---
>  Documentation/userspace-api/vduse.rst | 41 +++
>  1 file changed, 41 insertions(+)
>
> diff --git a/Documentation/userspace-api/vduse.rst 
> b/Documentation/userspace-api/vduse.rst
> index bdb880e01132..7faa83462e78 100644
> --- a/Documentation/userspace-api/vduse.rst
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -231,3 +231,44 @@ able to start the dataplane processing as follows:
> after the used ring is filled.
>
>  For more details on the uAPI, please see include/uapi/linux/vduse.h.
> +
> +HOW VDUSE devices reconnection works
> +
> +1. What is reconnection?
> +
> +   When the userspace application loads, it should establish a connection
> +   to the vduse kernel device. Sometimes,the userspace application exists,

I guess you meant "exists"? If yes, it should be better to say "exits
unexpectedly"

> +   and we want to support its restart and connect to the kernel device again
> +
> +2. How can I support reconnection in a userspace application?

Better to say "How reconnection is supported"?

> +
> +2.1 During initialization, the userspace application should first verify the
> +existence of the device "/dev/vduse/vduse_name".
> +If it doesn't exist, it means this is the first-time for connection. 
> goto step 2.2
> +If it exists, it means this is a reconnection, and we should goto step 
> 2.3
> +
> +2.2 Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +/dev/vduse/control.
> +When ioctl(VDUSE_CREATE_DEV) is called, kernel allocates memory for
> +the reconnect information. The total memory size is PAGE_SIZE*vq_mumber.

I think we need to mention that this should be part of the previous
"VDUSE devices are created as follows"?

> +
> +2.3 Check if the information is suitable for reconnect
> +If this is reconnection :
> +Before attempting to reconnect, The userspace application needs to use 
> the
> +ioctl(VDUSE_DEV_GET_CONFIG, VDUSE_DEV_GET_STATUS, 
> VDUSE_DEV_GET_FEATURES...)
> +to get the information from kernel.
> +Please review the information and confirm if it is suitable to reconnect.

Need to define "review" here and how to decide if it is not suitable
to reconnect.

> +
> +2.4 Userspace application needs to mmap the memory to userspace
> +The userspace application requires mapping one page for every vq. These 
> pages
> +should be used to save vq-related information during system running.

Not a native speaker, but it looks better with

"should be used by the userspace to store virtqueue specific information".

> Additionally,
> +the application must define its own structure to store information for 
> reconnection.
> +
> +2.5 Completed the initialization and running the application.
> +While the application is running, it is important to store relevant 
> information
> +about reconnections in mapped pages.

I think we need some link/code examples to demonstrate what needs to be stored.

> When calling the ioctl VDUSE_VQ_GET_INFO to
> +get vq information, it's necessary to check whether it's a reconnection.

Better with some examples of codes.

> If it is
> +a reconnection, the vq-related information must be get from the mapped 
> pages.
> +
> +2.6 When the Userspace application exits, it is necessary to unmap all the
> +pages for reconnection

This seems to be unnecessary, for example there could be an unexpected exit.

Thanks

> --
> 2.43.0
>

Re: [PATCH v2] vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API

2024-04-15 Thread Jason Wang

On Sun, Apr 14, 2024 at 6:04 PM Christophe JAILLET
 wrote:
>
> ida_alloc() and ida_free() should be preferred to the deprecated
> ida_simple_get() and ida_simple_remove().
>
> Note that the upper limit of ida_simple_get() is exclusive, but the one of
> ida_alloc_max() is inclusive. So a -1 has been added when needed.
>
> Signed-off-by: Christophe JAILLET 
> Reviewed-by: Simon Horman 

Acked-by: Jason Wang 

Thanks

Re: [PATCH] drivers/virtio: delayed configuration descriptor flags

2024-04-08 Thread Jason Wang

On Tue, Apr 9, 2024 at 1:27 AM ni.liqiang  wrote:
>
> In our testing of the virtio hardware accelerator, we found that
> configuring the flags of the descriptor after addr and len,
> as implemented in DPDK, seems to be more friendly to the hardware.
>
> In our Virtio hardware implementation tests, using the default
> open-source code, the hardware's bulk reads ensure performance
> but correctness is compromised. If we refer to the implementation code
> of DPDK, placing the flags configuration of the descriptor
> after addr and len, virtio backend can function properly based on
> our hardware accelerator.
>
> I am somewhat puzzled by this. From a software process perspective,
> it seems that there should be no difference whether
> the flags configuration of the descriptor is before or after addr and len.
> However, this is not the case according to experimental test results.
> We would like to know if such a change in the configuration order
> is reasonable and acceptable?

Harmless but a hint that there's a bug in your hardware?

More below

>
> Thanks.
>
> Signed-off-by: ni.liqiang 
> Reviewed-by: jin.qi 
> Tested-by: jin.qi 
> Cc: ni.liqiang 
> ---
>  drivers/virtio/virtio_ring.c | 9 +
>  1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 6f7e5010a673..bea2c2fb084e 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -1472,15 +1472,16 @@ static inline int virtqueue_add_packed(struct 
> virtqueue *_vq,
> flags = cpu_to_le16(vq->packed.avail_used_flags |
> (++c == total_sg ? 0 : VRING_DESC_F_NEXT) 
> |
> (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> -   if (i == head)
> -   head_flags = flags;
> -   else
> -   desc[i].flags = flags;
>
> desc[i].addr = cpu_to_le64(addr);
> desc[i].len = cpu_to_le32(sg->length);
> desc[i].id = cpu_to_le16(id);
>
> +   if (i == head)
> +   head_flags = flags;
> +   else
> +   desc[i].flags = flags;
> +

The head_flags are not updated at this time, so descriptors are not
available, the device should not start to read the chain:

/*
 * A driver MUST NOT make the first descriptor in the list
 * available before all subsequent descriptors comprising
 * the list are made available.
 */
virtio_wmb(vq->weak_barriers);
vq->packed.vring.desc[head].flags = head_flags;
vq->num_added += descs_used;

It looks like your device does speculation reading on the descriptors
that are not available?

Thanks

> if (unlikely(vq->use_dma_api)) {
> vq->packed.desc_extra[curr].addr = addr;
> vq->packed.desc_extra[curr].len = sg->length;
> --
> 2.34.1
>
>

Re: [PATCH v3] vp_vdpa: fix the method of calculating vectors

2024-04-08 Thread Jason Wang

   }
>
> snprintf(vp_vdpa->msix_name, VP_VDPA_NAME_SIZE, 
> "vp-vdpa[%s]-config\n",
> -pci_name(pdev));
> -   irq = pci_irq_vector(pdev, queues);
> +   pci_name(pdev));
> +   irq = pci_irq_vector(pdev, msix_vec);
> ret = devm_request_irq(>dev, irq, vp_vdpa_config_handler, 0,
>vp_vdpa->msix_name, vp_vdpa);
>     if (ret) {
> dev_err(>dev,
> -   "vp_vdpa: fail to request irq for vq %d\n", i);
> +   "vp_vdpa: fail to request irq for config\n");
> goto err;
> }
> -   vp_modern_config_vector(mdev, queues);
> +   vp_modern_config_vector(mdev, msix_vec);
> vp_vdpa->config_irq = irq;
> -

Unnecessary changes.

Others look good.

Acked-by: Jason Wang 

Thanks

> return 0;
>  err:
> vp_vdpa_free_irq(vp_vdpa);
> --
> 2.43.0
>

Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-04-08 Thread Jason Wang

On Mon, Apr 1, 2024 at 11:38 AM Liang Chen  wrote:
>
> On Thu, Feb 29, 2024 at 4:37 PM Liang Chen  wrote:
> >
> > On Tue, Feb 27, 2024 at 4:42 AM John Fastabend  
> > wrote:
> > >
> > > Jason Wang wrote:
> > > > On Fri, Feb 23, 2024 at 9:42 AM Xuan Zhuo  
> > > > wrote:
> > > > >
> > > > > On Fri, 09 Feb 2024 13:57:25 +0100, Paolo Abeni  
> > > > > wrote:
> > > > > > On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote:
> > > > > > > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > > > > > > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > > > > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > > > > > > > > [...]
> > > > > > > > > > > > > @@ -1033,6 +1039,16 @@ static void 
> > > > > > > > > > > > > put_xdp_frags(struct xdp_buff *xdp)
> > > > > > > > > > > > >   }
> > > > > > > > > > > > >   }
> > > > > > > > > > > > >
> > > > > > > > > > > > > +static void virtnet_xdp_save_rx_hash(struct 
> > > > > > > > > > > > > virtnet_xdp_buff *virtnet_xdp,
> > > > > > > > > > > > > +  struct net_device 
> > > > > > > > > > > > > *dev,
> > > > > > > > > > > > > +  struct 
> > > > > > > > > > > > > virtio_net_hdr_v1_hash *hdr_hash)
> > > > > > > > > > > > > +{
> > > > > > > > > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > > > > > > > > + virtnet_xdp->hash_value = 
> > > > > > > > > > > > > hdr_hash->hash_value;
> > > > > > > > > > > > > + virtnet_xdp->hash_report = 
> > > > > > > > > > > > > hdr_hash->hash_report;
> > > > > > > > > > > > > + }
> > > > > > > > > > > > > +}
> > > > > > > > > > > > > +
> > > > > > > > > > > >
> > > > > > > > > > > > Would it be possible to store a pointer to hdr_hash in 
> > > > > > > > > > > > virtnet_xdp_buff,
> > > > > > > > > > > > with the purpose of delaying extracting this, until and 
> > > > > > > > > > > > only if XDP
> > > > > > > > > > > > bpf_prog calls the kfunc?
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > That seems to be the way v1 works,
> > > > > > > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > > > > > > > > . But it was pointed out that the inline header may be 
> > > > > > > > > > > overwritten by
> > > > > > > > > > > the xdp prog, so the hash is copied out to maintain its 
> > > > > > > > > > > integrity.
> > > > > > > > > >
> > > > > > > > > > Why? isn't XDP supposed to get write access only to the pkt
> > > > > > > > > > contents/buffer?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Normally, an XDP program accesses only the packet data. 
> > > > > > > > > However,
> > > > > > &

Re: Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-04-08 Thread Jason Wang

On Wed, Apr 3, 2024 at 10:47 AM tab  wrote:
>
> > >
> > > On Fri, Mar 29, 2024 at 11:55:50AM +0800, Jason Wang wrote:
> > > > On Wed, Mar 27, 2024 at 5:08 PM Jason Wang  wrote:
> > > > >
> > > > > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > > > > From: Rong Wang 
> > > > > > >
> > > > > > > Once enable iommu domain for one device, the MSI
> > > > > > > translation tables have to be there for software-managed MSI.
> > > > > > > Otherwise, platform with software-managed MSI without an
> > > > > > > irq bypass function, can not get a correct memory write event
> > > > > > > from pcie, will not get irqs.
> > > > > > > The solution is to obtain the MSI phy base address from
> > > > > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > > > > then translation tables will be created while request irq.
> > > > > > >
> > > > > > > Change log
> > > > > > > --
> > > > > > >
> > > > > > > v1->v2:
> > > > > > > - add resv iotlb to avoid overlap mapping.
> > > > > > > v2->v3:
> > > > > > > - there is no need to export the iommu symbol anymore.
> > > > > > >
> > > > > > > Signed-off-by: Rong Wang 
> > > > > >
> > > > > > There's in interest to keep extending vhost iotlb -
> > > > > > we should just switch over to iommufd which supports
> > > > > > this already.
> > > > >
> > > > > IOMMUFD is good but VFIO supports this before IOMMUFD. This patch
> > > > > makes vDPA run without a backporting of full IOMMUFD in the production
> > > > > environment. I think it's worth.
> > > > >
> > > > > If you worry about the extension, we can just use the vhost iotlb
> > > > > existing facility to do this.
> > > > >
> > > > > Thanks
> > > >
> > > > Btw, Wang Rong,
> > > >
> > > > It looks that Cindy does have the bandwidth in working for IOMMUFD 
> > > > support.
> > >
> > > I think you mean she does not.
> >
> > Yes, you are right.
> >
> > Thanks
>
> I need to discuss internally, and there may be someone else will do that.
>
> Thanks.

Ok, please let us know if you have a conclusion.

Thanks

>
> >
> > >
> > > > Do you have the will to do that?
> > > >
> > > > Thanks
> > >
>
>
>
>
> --
> 发自我的网易邮箱平板适配版
> 
>
>
> - Original Message -
> From: "Jason Wang" 
> To: "Michael S. Tsirkin" 
> Cc: "Wang Rong" , k...@vger.kernel.org, 
> virtualizat...@lists.linux.dev, net...@vger.kernel.org, 
> linux-kernel@vger.kernel.org, "Cindy Lu" 
> Sent: Fri, 29 Mar 2024 18:39:54 +0800
> Subject: Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for 
> software-managed MSI
>
> On Fri, Mar 29, 2024 at 5:13 PM Michael S. Tsirkin  wrote:
> >
> > On Fri, Mar 29, 2024 at 11:55:50AM +0800, Jason Wang wrote:
> > > On Wed, Mar 27, 2024 at 5:08 PM Jason Wang  wrote:
> > > >
> > > > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > > > From: Rong Wang 
> > > > > >
> > > > > > Once enable iommu domain for one device, the MSI
> > > > > > translation tables have to be there for software-managed MSI.
> > > > > > Otherwise, platform with software-managed MSI without an
> > > > > > irq bypass function, can not get a correct memory write event
> > > > > > from pcie, will not get irqs.
> > > > > > The solution is to obtain the MSI phy base address from
> > > > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > > > then translation tables will be created while request irq.
> > > > > >
> > > > > > Change log
> > > > > > --
> > > > > >
> > > > > > v1->v2:
> > > > > > - add resv iotlb to avoid overlap mapping.
> > > > > > v2->v3:
> > > > > > - there is no need to export the iommu symbol anymore.
> > > > > >
> > > > > > Signed-off-by: Rong Wang 
> > > > >
> > > > > There's in interest to keep extending vhost iotlb -
> > > > > we should just switch over to iommufd which supports
> > > > > this already.
> > > >
> > > > IOMMUFD is good but VFIO supports this before IOMMUFD. This patch
> > > > makes vDPA run without a backporting of full IOMMUFD in the production
> > > > environment. I think it's worth.
> > > >
> > > > If you worry about the extension, we can just use the vhost iotlb
> > > > existing facility to do this.
> > > >
> > > > Thanks
> > >
> > > Btw, Wang Rong,
> > >
> > > It looks that Cindy does have the bandwidth in working for IOMMUFD 
> > > support.
> >
> > I think you mean she does not.
>
> Yes, you are right.
>
> Thanks
>
> >
> > > Do you have the will to do that?
> > >
> > > Thanks
> >

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-04-06 Thread Jason Wang

On Fri, Mar 29, 2024 at 6:42 PM Michael S. Tsirkin  wrote:
>
> On Fri, Mar 29, 2024 at 06:39:33PM +0800, Jason Wang wrote:
> > On Fri, Mar 29, 2024 at 5:13 PM Michael S. Tsirkin  wrote:
> > >
> > > On Wed, Mar 27, 2024 at 05:08:57PM +0800, Jason Wang wrote:
> > > > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > > > From: Rong Wang 
> > > > > >
> > > > > > Once enable iommu domain for one device, the MSI
> > > > > > translation tables have to be there for software-managed MSI.
> > > > > > Otherwise, platform with software-managed MSI without an
> > > > > > irq bypass function, can not get a correct memory write event
> > > > > > from pcie, will not get irqs.
> > > > > > The solution is to obtain the MSI phy base address from
> > > > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > > > then translation tables will be created while request irq.
> > > > > >
> > > > > > Change log
> > > > > > --
> > > > > >
> > > > > > v1->v2:
> > > > > > - add resv iotlb to avoid overlap mapping.
> > > > > > v2->v3:
> > > > > > - there is no need to export the iommu symbol anymore.
> > > > > >
> > > > > > Signed-off-by: Rong Wang 
> > > > >
> > > > > There's in interest to keep extending vhost iotlb -
> > > > > we should just switch over to iommufd which supports
> > > > > this already.
> > > >
> > > > IOMMUFD is good but VFIO supports this before IOMMUFD.
> > >
> > > You mean VFIO migrated to IOMMUFD but of course they keep supporting
> > > their old UAPI?
> >
> > I meant VFIO support software managed MSI before IOMMUFD.
>
> And then they switched over and stopped adding new IOMMU
> related features. And so should vdpa?

For some cloud vendors, it means vDPA can't be used until

1) IOMMUFD support for vDPA is supported by upstream
2) IOMMUFD is backported

1) might be fine but 2) might be impossible.

Assuming IOMMUFD hasn't been done for vDPA. Adding small features like
this seems reasonable (especially considering it is supported by the
"legacy" VFIO container).

Thanks

>
>
> > > OK and point being?
> > >
> > > > This patch
> > > > makes vDPA run without a backporting of full IOMMUFD in the production
> > > > environment. I think it's worth.
> > >
> > > Where do we stop? saying no to features is the only tool maintainers
> > > have to make cleanups happen, otherwise people will just keep piling
> > > stuff up.
> >
> > I think we should not have more features than VFIO without IOMMUFD.
> >
> > Thanks
> >
> > >
> > > > If you worry about the extension, we can just use the vhost iotlb
> > > > existing facility to do this.
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > > ---
> > > > > >  drivers/vhost/vdpa.c | 59 
> > > > > > +---
> > > > > >  1 file changed, 56 insertions(+), 3 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > > > > index ba52d128aeb7..28b56b10372b 100644
> > > > > > --- a/drivers/vhost/vdpa.c
> > > > > > +++ b/drivers/vhost/vdpa.c
> > > > > > @@ -49,6 +49,7 @@ struct vhost_vdpa {
> > > > > >   struct completion completion;
> > > > > >   struct vdpa_device *vdpa;
> > > > > >   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> > > > > > + struct vhost_iotlb resv_iotlb;
> > > > > >   struct device dev;
> > > > > >   struct cdev cdev;
> > > > > >   atomic_t opened;
> > > > > > @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa 
> > > > > > *v)
> > > > > >  static int vhost_vdpa_reset(struct vhost_vdpa *v)
> > > > > >  {
> > > > > >   v->in_batch = 0;
> > > > > > + vhost_iotlb_reset(>resv_iotlb);
> > > > > >   retu

Re: [PATCH] vhost-vdpa: change ioctl # for VDPA_GET_VRING_SIZE

2024-04-06 Thread Jason Wang

On Wed, Apr 3, 2024 at 5:21 AM Michael S. Tsirkin  wrote:
>
> VDPA_GET_VRING_SIZE by mistake uses the already occupied
> ioctl # 0x80 and we never noticed - it happens to work
> because the direction and size are different, but confuses
> tools such as perf which like to look at just the number,
> and breaks the extra robustness of the ioctl numbering macros.
>
> To fix, sort the entries and renumber the ioctl - not too late
> since it wasn't in any released kernels yet.
>
> Cc: Arnaldo Carvalho de Melo 
> Reported-by: Namhyung Kim 
> Fixes: x ("vhost-vdpa: uapi to support reporting per vq size")
> Cc: "Zhu Lingshan" 
> Signed-off-by: Michael S. Tsirkin 

Acked-by: Jason Wang 

Thanks

> ---
>
> Build tested only - userspace patches using this will have to adjust.
> I will merge this in a week or so unless I hear otherwise,
> and afterwards perf can update there header.
>
>  include/uapi/linux/vhost.h | 15 ---
>  1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
> index bea697390613..b95dd84eef2d 100644
> --- a/include/uapi/linux/vhost.h
> +++ b/include/uapi/linux/vhost.h
> @@ -179,12 +179,6 @@
>  /* Get the config size */
>  #define VHOST_VDPA_GET_CONFIG_SIZE _IOR(VHOST_VIRTIO, 0x79, __u32)
>
> -/* Get the count of all virtqueues */
> -#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
> -
> -/* Get the number of virtqueue groups. */
> -#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
> -
>  /* Get the number of address spaces. */
>  #define VHOST_VDPA_GET_AS_NUM  _IOR(VHOST_VIRTIO, 0x7A, unsigned int)
>
> @@ -228,10 +222,17 @@
>  #define VHOST_VDPA_GET_VRING_DESC_GROUP_IOWR(VHOST_VIRTIO, 0x7F, 
>   \
>   struct vhost_vring_state)
>
> +
> +/* Get the count of all virtqueues */
> +#define VHOST_VDPA_GET_VQS_COUNT   _IOR(VHOST_VIRTIO, 0x80, __u32)
> +
> +/* Get the number of virtqueue groups. */
> +#define VHOST_VDPA_GET_GROUP_NUM   _IOR(VHOST_VIRTIO, 0x81, __u32)
> +
>  /* Get the queue size of a specific virtqueue.
>   * userspace set the vring index in vhost_vring_state.index
>   * kernel set the queue size in vhost_vring_state.num
>   */
> -#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x80,   \
> +#define VHOST_VDPA_GET_VRING_SIZE  _IOWR(VHOST_VIRTIO, 0x82,   \
>   struct vhost_vring_state)
>  #endif
> --
> MST
>

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-29 Thread Jason Wang

On Fri, Mar 29, 2024 at 5:13 PM Michael S. Tsirkin  wrote:
>
> On Fri, Mar 29, 2024 at 11:55:50AM +0800, Jason Wang wrote:
> > On Wed, Mar 27, 2024 at 5:08 PM Jason Wang  wrote:
> > >
> > > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > > From: Rong Wang 
> > > > >
> > > > > Once enable iommu domain for one device, the MSI
> > > > > translation tables have to be there for software-managed MSI.
> > > > > Otherwise, platform with software-managed MSI without an
> > > > > irq bypass function, can not get a correct memory write event
> > > > > from pcie, will not get irqs.
> > > > > The solution is to obtain the MSI phy base address from
> > > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > > then translation tables will be created while request irq.
> > > > >
> > > > > Change log
> > > > > --
> > > > >
> > > > > v1->v2:
> > > > > - add resv iotlb to avoid overlap mapping.
> > > > > v2->v3:
> > > > > - there is no need to export the iommu symbol anymore.
> > > > >
> > > > > Signed-off-by: Rong Wang 
> > > >
> > > > There's in interest to keep extending vhost iotlb -
> > > > we should just switch over to iommufd which supports
> > > > this already.
> > >
> > > IOMMUFD is good but VFIO supports this before IOMMUFD. This patch
> > > makes vDPA run without a backporting of full IOMMUFD in the production
> > > environment. I think it's worth.
> > >
> > > If you worry about the extension, we can just use the vhost iotlb
> > > existing facility to do this.
> > >
> > > Thanks
> >
> > Btw, Wang Rong,
> >
> > It looks that Cindy does have the bandwidth in working for IOMMUFD support.
>
> I think you mean she does not.

Yes, you are right.

Thanks

>
> > Do you have the will to do that?
> >
> > Thanks
>

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-29 Thread Jason Wang

On Fri, Mar 29, 2024 at 5:13 PM Michael S. Tsirkin  wrote:
>
> On Wed, Mar 27, 2024 at 05:08:57PM +0800, Jason Wang wrote:
> > On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  wrote:
> > >
> > > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > > From: Rong Wang 
> > > >
> > > > Once enable iommu domain for one device, the MSI
> > > > translation tables have to be there for software-managed MSI.
> > > > Otherwise, platform with software-managed MSI without an
> > > > irq bypass function, can not get a correct memory write event
> > > > from pcie, will not get irqs.
> > > > The solution is to obtain the MSI phy base address from
> > > > iommu reserved region, and set it to iommu MSI cookie,
> > > > then translation tables will be created while request irq.
> > > >
> > > > Change log
> > > > --
> > > >
> > > > v1->v2:
> > > > - add resv iotlb to avoid overlap mapping.
> > > > v2->v3:
> > > > - there is no need to export the iommu symbol anymore.
> > > >
> > > > Signed-off-by: Rong Wang 
> > >
> > > There's in interest to keep extending vhost iotlb -
> > > we should just switch over to iommufd which supports
> > > this already.
> >
> > IOMMUFD is good but VFIO supports this before IOMMUFD.
>
> You mean VFIO migrated to IOMMUFD but of course they keep supporting
> their old UAPI?

I meant VFIO support software managed MSI before IOMMUFD.

> OK and point being?
>
> > This patch
> > makes vDPA run without a backporting of full IOMMUFD in the production
> > environment. I think it's worth.
>
> Where do we stop? saying no to features is the only tool maintainers
> have to make cleanups happen, otherwise people will just keep piling
> stuff up.

I think we should not have more features than VFIO without IOMMUFD.

Thanks

>
> > If you worry about the extension, we can just use the vhost iotlb
> > existing facility to do this.
> >
> > Thanks
> >
> > >
> > > > ---
> > > >  drivers/vhost/vdpa.c | 59 +---
> > > >  1 file changed, 56 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > > index ba52d128aeb7..28b56b10372b 100644
> > > > --- a/drivers/vhost/vdpa.c
> > > > +++ b/drivers/vhost/vdpa.c
> > > > @@ -49,6 +49,7 @@ struct vhost_vdpa {
> > > >   struct completion completion;
> > > >   struct vdpa_device *vdpa;
> > > >   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> > > > + struct vhost_iotlb resv_iotlb;
> > > >   struct device dev;
> > > >   struct cdev cdev;
> > > >   atomic_t opened;
> > > > @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
> > > >  static int vhost_vdpa_reset(struct vhost_vdpa *v)
> > > >  {
> > > >   v->in_batch = 0;
> > > > + vhost_iotlb_reset(>resv_iotlb);
> > > >   return _compat_vdpa_reset(v);
> > > >  }
> > > >
> > > > @@ -1219,10 +1221,15 @@ static int 
> > > > vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> > > >   msg->iova + msg->size - 1 > v->range.last)
> > > >   return -EINVAL;
> > > >
> > > > + if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
> > > > + msg->iova + msg->size - 1))
> > > > + return -EINVAL;
> > > > +
> > > >   if (vhost_iotlb_itree_first(iotlb, msg->iova,
> > > >   msg->iova + msg->size - 1))
> > > >   return -EEXIST;
> > > >
> > > > +
> > > >   if (vdpa->use_va)
> > > >   return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size,
> > > >msg->uaddr, msg->perm);
> > > > @@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct 
> > > > kiocb *iocb,
> > > >   return vhost_chr_write_iter(dev, from);
> > > >  }
> > > >
> > > > +static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, 
> > > > struct device *dma_dev,
>

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-28 Thread Jason Wang

On Wed, Mar 27, 2024 at 5:08 PM Jason Wang  wrote:
>
> On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > > From: Rong Wang 
> > >
> > > Once enable iommu domain for one device, the MSI
> > > translation tables have to be there for software-managed MSI.
> > > Otherwise, platform with software-managed MSI without an
> > > irq bypass function, can not get a correct memory write event
> > > from pcie, will not get irqs.
> > > The solution is to obtain the MSI phy base address from
> > > iommu reserved region, and set it to iommu MSI cookie,
> > > then translation tables will be created while request irq.
> > >
> > > Change log
> > > --
> > >
> > > v1->v2:
> > > - add resv iotlb to avoid overlap mapping.
> > > v2->v3:
> > > - there is no need to export the iommu symbol anymore.
> > >
> > > Signed-off-by: Rong Wang 
> >
> > There's in interest to keep extending vhost iotlb -
> > we should just switch over to iommufd which supports
> > this already.
>
> IOMMUFD is good but VFIO supports this before IOMMUFD. This patch
> makes vDPA run without a backporting of full IOMMUFD in the production
> environment. I think it's worth.
>
> If you worry about the extension, we can just use the vhost iotlb
> existing facility to do this.
>
> Thanks

Btw, Wang Rong,

It looks that Cindy does have the bandwidth in working for IOMMUFD support.

Do you have the will to do that?

Thanks

Re: [PATCH v2 1/1] vhost: Added pad cleanup if vnet_hdr is not present.

2024-03-27 Thread Jason Wang

On Thu, Mar 28, 2024 at 7:44 AM Andrew Melnychenko  wrote:
>
> When the Qemu launched with vhost but without tap vnet_hdr,
> vhost tries to copy vnet_hdr from socket iter with size 0
> to the page that may contain some trash.
> That trash can be interpreted as unpredictable values for
> vnet_hdr.
> That leads to dropping some packets and in some cases to
> stalling vhost routine when the vhost_net tries to process
> packets and fails in a loop.
>
> Qemu options:
>   -netdev tap,vhost=on,vnet_hdr=off,...
>
> From security point of view, wrong values on field used later
> tap's tap_get_user_xdp() and will affect skb gso and options.
> Later the header(and data in headroom) should not be used by the stack.
> Using custom socket as a backend to vhost_net can reveal some data
> in the vnet_hdr, although it would require kernel access to implement.
>
> The issue happens because the value of sock_len in virtqueue is 0.
> That value is set at vhost_net_set_features() with
> VHOST_NET_F_VIRTIO_NET_HDR, also it's set to zero at device open()
> and reset() routine.
> So, currently, to trigger the issue, we need to set up qemu with
> vhost=on,vnet_hdr=off, or do not configure vhost in the custom program.
>
> Signed-off-by: Andrew Melnychenko 

Acked-by: Jason Wang 

It seems it has been merged by Michael.

Thanks

> ---
>  drivers/vhost/net.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f2ed7167c848..57411ac2d08b 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -735,6 +735,9 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
> *nvq,
> hdr = buf;
> gso = >gso;
>
> +   if (!sock_hlen)
> +   memset(buf, 0, pad);
> +
> if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> vhost16_to_cpu(vq, gso->csum_start) +
> vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> --
> 2.43.0
>

Re: [PATCH v3 3/3] vhost: Improve vhost_get_avail_idx() with smp_rmb()

2024-03-27 Thread Jason Wang

On Thu, Mar 28, 2024 at 8:22 AM Gavin Shan  wrote:
>
> All the callers of vhost_get_avail_idx() are concerned to the memory
> barrier, imposed by smp_rmb() to ensure the order of the available
> ring entry read and avail_idx read.
>
> Improve vhost_get_avail_idx() so that smp_rmb() is executed when
> the avail_idx is advanced. With it, the callers needn't to worry
> about the memory barrier.
>
> Suggested-by: Michael S. Tsirkin 
> Signed-off-by: Gavin Shan 

Acked-by: Jason Wang 

Thanks

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-27 Thread Jason Wang

On Thu, Mar 21, 2024 at 3:00 PM Michael S. Tsirkin  wrote:
>
> On Wed, Mar 20, 2024 at 06:19:12PM +0800, Wang Rong wrote:
> > From: Rong Wang 
> >
> > Once enable iommu domain for one device, the MSI
> > translation tables have to be there for software-managed MSI.
> > Otherwise, platform with software-managed MSI without an
> > irq bypass function, can not get a correct memory write event
> > from pcie, will not get irqs.
> > The solution is to obtain the MSI phy base address from
> > iommu reserved region, and set it to iommu MSI cookie,
> > then translation tables will be created while request irq.
> >
> > Change log
> > --
> >
> > v1->v2:
> > - add resv iotlb to avoid overlap mapping.
> > v2->v3:
> > - there is no need to export the iommu symbol anymore.
> >
> > Signed-off-by: Rong Wang 
>
> There's in interest to keep extending vhost iotlb -
> we should just switch over to iommufd which supports
> this already.

IOMMUFD is good but VFIO supports this before IOMMUFD. This patch
makes vDPA run without a backporting of full IOMMUFD in the production
environment. I think it's worth.

If you worry about the extension, we can just use the vhost iotlb
existing facility to do this.

Thanks

>
> > ---
> >  drivers/vhost/vdpa.c | 59 +---
> >  1 file changed, 56 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > index ba52d128aeb7..28b56b10372b 100644
> > --- a/drivers/vhost/vdpa.c
> > +++ b/drivers/vhost/vdpa.c
> > @@ -49,6 +49,7 @@ struct vhost_vdpa {
> >   struct completion completion;
> >   struct vdpa_device *vdpa;
> >   struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> > + struct vhost_iotlb resv_iotlb;
> >   struct device dev;
> >   struct cdev cdev;
> >   atomic_t opened;
> > @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
> >  static int vhost_vdpa_reset(struct vhost_vdpa *v)
> >  {
> >   v->in_batch = 0;
> > + vhost_iotlb_reset(>resv_iotlb);
> >   return _compat_vdpa_reset(v);
> >  }
> >
> > @@ -1219,10 +1221,15 @@ static int vhost_vdpa_process_iotlb_update(struct 
> > vhost_vdpa *v,
> >   msg->iova + msg->size - 1 > v->range.last)
> >   return -EINVAL;
> >
> > + if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
> > + msg->iova + msg->size - 1))
> > + return -EINVAL;
> > +
> >   if (vhost_iotlb_itree_first(iotlb, msg->iova,
> >   msg->iova + msg->size - 1))
> >   return -EEXIST;
> >
> > +
> >   if (vdpa->use_va)
> >   return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size,
> >msg->uaddr, msg->perm);
> > @@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct 
> > kiocb *iocb,
> >   return vhost_chr_write_iter(dev, from);
> >  }
> >
> > +static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, 
> > struct device *dma_dev,
> > + struct vhost_iotlb *resv_iotlb)
> > +{
> > + struct list_head dev_resv_regions;
> > + phys_addr_t resv_msi_base = 0;
> > + struct iommu_resv_region *region;
> > + int ret = 0;
> > + bool with_sw_msi = false;
> > + bool with_hw_msi = false;
> > +
> > + INIT_LIST_HEAD(_resv_regions);
> > + iommu_get_resv_regions(dma_dev, _resv_regions);
> > +
> > + list_for_each_entry(region, _resv_regions, list) {
> > + ret = vhost_iotlb_add_range_ctx(resv_iotlb, region->start,
> > + region->start + region->length - 1,
> > + 0, 0, NULL);
> > + if (ret) {
> > + vhost_iotlb_reset(resv_iotlb);
> > + break;
> > + }
> > +
> > + if (region->type == IOMMU_RESV_MSI)
> > + with_hw_msi = true;
> > +
> > + if (region->type == IOMMU_RESV_SW_MSI) {
> > + resv_msi_base = region->start;
> > + with_sw_msi = true;
> > + }
> > + }
> > +
> > + if (!ret && !with_hw_msi && with_sw_msi)
> > + ret = iommu_get_msi_cookie(domain, resv_msi_base);
> > +
> > + iommu_put_resv_regions(dma_dev, _resv_regions);
> > +
> > + return ret;
> > +}
> > +
> >  static int vhost_vdpa_alloc_domain(struct vhost_vdpa *v)
> >  {
> >   struct vdpa_device *vdpa = v->vdpa;
> > @@ -1335,11 +1381,16 @@ static int vhost_vdpa_alloc_domain(struct 
> > vhost_vdpa *v)
> >
> >   ret = iommu_attach_device(v->domain, dma_dev);
> >   if (ret)
> > - goto err_attach;
> > + goto err_alloc_domain;
> >
> > - return 0;
> > + ret = vhost_vdpa_resv_iommu_region(v->domain, dma_dev, 
> > >resv_iotlb);
> > + if (ret)
> > + goto err_attach_device;
> >
> > -err_attach:
> > + return 0;
> > +err_attach_device:
> > + iommu_detach_device(v->domain,

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-27 Thread Jason Wang

On Wed, Mar 27, 2024 at 3:35 PM Gavin Shan  wrote:
>
> On 3/27/24 14:08, Gavin Shan wrote:
> > On 3/27/24 12:44, Jason Wang wrote:
> >> On Wed, Mar 27, 2024 at 10:34 AM Jason Wang  wrote:
> >>> On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:
> >>>>
> >>>> A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
> >>>> Will Deacon . Otherwise, it's not ensured the
> >>>> available ring entries pushed by guest can be observed by vhost
> >>>> in time, leading to stale available ring entries fetched by vhost
> >>>> in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
> >>>> grace-hopper (ARM64) platform.
> >>>>
> >>>>/home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> >>>>-accel kvm -machine virt,gic-version=host -cpu host  \
> >>>>-smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
> >>>>-m 4096M,slots=16,maxmem=64G \
> >>>>-object memory-backend-ram,id=mem0,size=4096M\
> >>>> :   \
> >>>>-netdev tap,id=vnet0,vhost=true  \
> >>>>-device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
> >>>> :
> >>>>guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
> >>>>virtio_net virtio0: output.0:id 100 is not a head!
> >>>>
> >>>> Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
> >>>> should be safe until vq->avail_idx is changed by commit 275bf960ac697
> >>>> ("vhost: better detection of available buffers").
> >>>>
> >>>> Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
> >>>> Cc:  # v4.11+
> >>>> Reported-by: Yihuang Yu 
> >>>> Signed-off-by: Gavin Shan 
> >>>> ---
> >>>>   drivers/vhost/vhost.c | 11 ++-
> >>>>   1 file changed, 10 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> >>>> index 045f666b4f12..00445ab172b3 100644
> >>>> --- a/drivers/vhost/vhost.c
> >>>> +++ b/drivers/vhost/vhost.c
> >>>> @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, 
> >>>> struct vhost_virtqueue *vq)
> >>>>  r = vhost_get_avail_idx(vq, _idx);
> >>>>  if (unlikely(r))
> >>>>  return false;
> >>>> +
> >>>>  vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> >>>> +   if (vq->avail_idx != vq->last_avail_idx) {
> >>>> +   /* Similar to what's done in vhost_get_vq_desc(), we need
> >>>> +* to ensure the available ring entries have been exposed
> >>>> +* by guest.
> >>>> +*/
> >>>
> >>> We need to be more verbose here. For example, which load needs to be
> >>> ordered with which load.
> >>>
> >>> The rmb in vhost_get_vq_desc() is used to order the load of avail idx
> >>> and the load of head. It is paired with e.g virtio_wmb() in
> >>> virtqueue_add_split().
> >>>
> >>> vhost_vq_avail_empty() are mostly used as a hint in
> >>> vhost_net_busy_poll() which is under the protection of the vq mutex.
> >>>
> >>> An exception is the tx_can_batch(), but in that case it doesn't even
> >>> want to read the head.
> >>
> >> Ok, if it is needed only in that path, maybe we can move the barriers 
> >> there.
> >>
> >
> > [cc Will Deacon]
> >
> > Jason, appreciate for your review and comments. I think PATCH[1/2] is
> > the fix for the hypothesis, meaning PATCH[2/2] is the real fix. However,
> > it would be nice to fix all of them in one shoot. I will try with PATCH[2/2]
> > only to see if our issue will disappear or not. However, the issue still
> > exists if PATCH[2/2] is missed.
> >
>
> Jason, PATCH[2/2] is sufficient to fix our current issue. I tried with 
> PATCH[2/2]
> only and unable to hit the issue. However, PATCH[1/2] may be needed by other 
> scenarios.
> So it would be nice to fix them in one shoot.

Yes, see below.

>
>
> > Firstly, We were fa

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-26 Thread Jason Wang

On Wed, Mar 27, 2024 at 10:34 AM Jason Wang  wrote:
>
> On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:
> >
> > A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
> > Will Deacon . Otherwise, it's not ensured the
> > available ring entries pushed by guest can be observed by vhost
> > in time, leading to stale available ring entries fetched by vhost
> > in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
> > grace-hopper (ARM64) platform.
> >
> >   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
> >   -accel kvm -machine virt,gic-version=host -cpu host  \
> >   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
> >   -m 4096M,slots=16,maxmem=64G \
> >   -object memory-backend-ram,id=mem0,size=4096M\
> >:   \
> >   -netdev tap,id=vnet0,vhost=true  \
> >   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
> >:
> >   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
> >   virtio_net virtio0: output.0:id 100 is not a head!
> >
> > Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
> > should be safe until vq->avail_idx is changed by commit 275bf960ac697
> > ("vhost: better detection of available buffers").
> >
> > Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
> > Cc:  # v4.11+
> > Reported-by: Yihuang Yu 
> > Signed-off-by: Gavin Shan 
> > ---
> >  drivers/vhost/vhost.c | 11 ++-
> >  1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 045f666b4f12..00445ab172b3 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, 
> > struct vhost_virtqueue *vq)
> > r = vhost_get_avail_idx(vq, _idx);
> > if (unlikely(r))
> > return false;
> > +
> > vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> > +   if (vq->avail_idx != vq->last_avail_idx) {
> > +   /* Similar to what's done in vhost_get_vq_desc(), we need
> > +* to ensure the available ring entries have been exposed
> > +* by guest.
> > +*/
>
> We need to be more verbose here. For example, which load needs to be
> ordered with which load.
>
> The rmb in vhost_get_vq_desc() is used to order the load of avail idx
> and the load of head. It is paired with e.g virtio_wmb() in
> virtqueue_add_split().
>
> vhost_vq_avail_empty() are mostly used as a hint in
> vhost_net_busy_poll() which is under the protection of the vq mutex.
>
> An exception is the tx_can_batch(), but in that case it doesn't even
> want to read the head.

Ok, if it is needed only in that path, maybe we can move the barriers there.

Thanks

>
> Thanks
>
>
> > +   smp_rmb();
> > +   return false;
> > +   }
> >
> > -   return vq->avail_idx == vq->last_avail_idx;
> > +   return true;
> >  }
> >  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
> >
> > --
> > 2.44.0
> >

Re: [PATCH v2 2/2] vhost: Add smp_rmb() in vhost_enable_notify()

2024-03-26 Thread Jason Wang

On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:
>
> A smp_rmb() has been missed in vhost_enable_notify(), inspired by
> Will Deacon . Otherwise, it's not ensured the
> available ring entries pushed by guest can be observed by vhost
> in time, leading to stale available ring entries fetched by vhost
> in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
> grace-hopper (ARM64) platform.
>
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host -cpu host  \
>   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
>   -m 4096M,slots=16,maxmem=64G \
>   -object memory-backend-ram,id=mem0,size=4096M\
>:   \
>   -netdev tap,id=vnet0,vhost=true  \
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
>:
>   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
>   virtio_net virtio0: output.0:id 100 is not a head!
>
> Add the missed smp_rmb() in vhost_enable_notify(). Note that it
> should be safe until vq->avail_idx is changed by commit d3bb267bbdcb
> ("vhost: cache avail index in vhost_enable_notify()").
>
> Fixes: d3bb267bbdcb ("vhost: cache avail index in vhost_enable_notify()")
> Cc:  # v5.18+
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vhost/vhost.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 00445ab172b3..58f9d6a435f0 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2847,9 +2847,18 @@ bool vhost_enable_notify(struct vhost_dev *dev, struct 
> vhost_virtqueue *vq)
>>avail->idx, r);
> return false;
> }
> +
> vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> +   if (vq->avail_idx != vq->last_avail_idx) {
> +   /* Similar to what's done in vhost_get_vq_desc(), we need
> +* to ensure the available ring entries have been exposed
> +* by guest.
> +*/
> +   smp_rmb();
> +   return true;
> +   }
>
> -   return vq->avail_idx != vq->last_avail_idx;
> +   return false;

So we only care about the case when vhost_enable_notify() returns true.

In that case, I think you want to order with vhost_get_vq_desc():

last_avail_idx = vq->last_avail_idx;

if (vq->avail_idx == vq->last_avail_idx) { /* false */
}

vhost_get_avail_head(vq, _head, last_avail_idx)

Assuming I understand the patch correctly.

Acked-by: Jason Wang 

Thanks

>  }
>  EXPORT_SYMBOL_GPL(vhost_enable_notify);
>
> --
> 2.44.0
>

Re: [PATCH v2 1/2] vhost: Add smp_rmb() in vhost_vq_avail_empty()

2024-03-26 Thread Jason Wang

On Wed, Mar 27, 2024 at 7:39 AM Gavin Shan  wrote:
>
> A smp_rmb() has been missed in vhost_vq_avail_empty(), spotted by
> Will Deacon . Otherwise, it's not ensured the
> available ring entries pushed by guest can be observed by vhost
> in time, leading to stale available ring entries fetched by vhost
> in vhost_get_vq_desc(), as reported by Yihuang Yu on NVidia's
> grace-hopper (ARM64) platform.
>
>   /home/gavin/sandbox/qemu.main/build/qemu-system-aarch64  \
>   -accel kvm -machine virt,gic-version=host -cpu host  \
>   -smp maxcpus=1,cpus=1,sockets=1,clusters=1,cores=1,threads=1 \
>   -m 4096M,slots=16,maxmem=64G \
>   -object memory-backend-ram,id=mem0,size=4096M\
>:   \
>   -netdev tap,id=vnet0,vhost=true  \
>   -device virtio-net-pci,bus=pcie.8,netdev=vnet0,mac=52:54:00:f1:26:b0
>:
>   guest# netperf -H 10.26.1.81 -l 60 -C -c -t UDP_STREAM
>   virtio_net virtio0: output.0:id 100 is not a head!
>
> Add the missed smp_rmb() in vhost_vq_avail_empty(). Note that it
> should be safe until vq->avail_idx is changed by commit 275bf960ac697
> ("vhost: better detection of available buffers").
>
> Fixes: 275bf960ac697 ("vhost: better detection of available buffers")
> Cc:  # v4.11+
> Reported-by: Yihuang Yu 
> Signed-off-by: Gavin Shan 
> ---
>  drivers/vhost/vhost.c | 11 ++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 045f666b4f12..00445ab172b3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2799,9 +2799,18 @@ bool vhost_vq_avail_empty(struct vhost_dev *dev, 
> struct vhost_virtqueue *vq)
> r = vhost_get_avail_idx(vq, _idx);
> if (unlikely(r))
> return false;
> +
> vq->avail_idx = vhost16_to_cpu(vq, avail_idx);
> +   if (vq->avail_idx != vq->last_avail_idx) {
> +   /* Similar to what's done in vhost_get_vq_desc(), we need
> +* to ensure the available ring entries have been exposed
> +* by guest.
> +*/

We need to be more verbose here. For example, which load needs to be
ordered with which load.

The rmb in vhost_get_vq_desc() is used to order the load of avail idx
and the load of head. It is paired with e.g virtio_wmb() in
virtqueue_add_split().

vhost_vq_avail_empty() are mostly used as a hint in
vhost_net_busy_poll() which is under the protection of the vq mutex.

An exception is the tx_can_batch(), but in that case it doesn't even
want to read the head.

Thanks


> +   smp_rmb();
> +   return false;
> +   }
>
> -   return vq->avail_idx == vq->last_avail_idx;
> +   return true;
>  }
>  EXPORT_SYMBOL_GPL(vhost_vq_avail_empty);
>
> --
> 2.44.0
>

Re: [PATCH v3] vhost/vdpa: Add MSI translation tables to iommu for software-managed MSI

2024-03-21 Thread Jason Wang

On Wed, Mar 20, 2024 at 6:20 PM Wang Rong  wrote:
>
> From: Rong Wang 
>
> Once enable iommu domain for one device, the MSI
> translation tables have to be there for software-managed MSI.
> Otherwise, platform with software-managed MSI without an
> irq bypass function, can not get a correct memory write event
> from pcie, will not get irqs.
> The solution is to obtain the MSI phy base address from
> iommu reserved region, and set it to iommu MSI cookie,
> then translation tables will be created while request irq.
>
> Change log
> --
>
> v1->v2:
> - add resv iotlb to avoid overlap mapping.
> v2->v3:
> - there is no need to export the iommu symbol anymore.
>
> Signed-off-by: Rong Wang 
> ---
>  drivers/vhost/vdpa.c | 59 +---
>  1 file changed, 56 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index ba52d128aeb7..28b56b10372b 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -49,6 +49,7 @@ struct vhost_vdpa {
> struct completion completion;
> struct vdpa_device *vdpa;
> struct hlist_head as[VHOST_VDPA_IOTLB_BUCKETS];
> +   struct vhost_iotlb resv_iotlb;

Is it better to introduce a reserved flag like VHOST_MAP_RESERVED,
which means it can't be modified by the userspace but the kernel.

So we don't need to have two IOTLB. But I guess the reason you have
this is because we may have multiple address spaces where the MSI
routing should work for all of them?

Another note, vhost-vDPA support virtual address mapping, so this
should only work for physicall address mapping. E.g in the case of
SVA, MSI iova is a valid IOVA for the driver/usrespace.

> struct device dev;
> struct cdev cdev;
> atomic_t opened;
> @@ -247,6 +248,7 @@ static int _compat_vdpa_reset(struct vhost_vdpa *v)
>  static int vhost_vdpa_reset(struct vhost_vdpa *v)
>  {
> v->in_batch = 0;
> +   vhost_iotlb_reset(>resv_iotlb);

We try hard to avoid this for performance, see this commit:

commit 4398776f7a6d532c466f9e41f601c9a291fac5ef
Author: Si-Wei Liu 
Date:   Sat Oct 21 02:25:15 2023 -0700

vhost-vdpa: introduce IOTLB_PERSIST backend feature bit

Any reason you need to do this?

> return _compat_vdpa_reset(v);
>  }
>
> @@ -1219,10 +1221,15 @@ static int vhost_vdpa_process_iotlb_update(struct 
> vhost_vdpa *v,
> msg->iova + msg->size - 1 > v->range.last)
> return -EINVAL;
>
> +   if (vhost_iotlb_itree_first(>resv_iotlb, msg->iova,
> +   msg->iova + msg->size - 1))
> +   return -EINVAL;
> +
> if (vhost_iotlb_itree_first(iotlb, msg->iova,
> msg->iova + msg->size - 1))
> return -EEXIST;
>
> +
> if (vdpa->use_va)
> return vhost_vdpa_va_map(v, iotlb, msg->iova, msg->size,
>  msg->uaddr, msg->perm);
> @@ -1307,6 +1314,45 @@ static ssize_t vhost_vdpa_chr_write_iter(struct kiocb 
> *iocb,
> return vhost_chr_write_iter(dev, from);
>  }
>
> +static int vhost_vdpa_resv_iommu_region(struct iommu_domain *domain, struct 
> device *dma_dev,
> +   struct vhost_iotlb *resv_iotlb)
> +{
> +   struct list_head dev_resv_regions;
> +   phys_addr_t resv_msi_base = 0;
> +   struct iommu_resv_region *region;
> +   int ret = 0;
> +   bool with_sw_msi = false;
> +   bool with_hw_msi = false;
> +
> +   INIT_LIST_HEAD(_resv_regions);
> +   iommu_get_resv_regions(dma_dev, _resv_regions);
> +
> +   list_for_each_entry(region, _resv_regions, list) {
> +   ret = vhost_iotlb_add_range_ctx(resv_iotlb, region->start,
> +   region->start + region->length - 1,
> +   0, 0, NULL);

I think MSI should be write-only?

> +   if (ret) {
> +   vhost_iotlb_reset(resv_iotlb);

Need to report an error here.

Thanks

Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-12 Thread Jason Wang

On Mon, Mar 11, 2024 at 9:28 PM wangyunjian  wrote:
>
>
>
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Monday, March 11, 2024 12:01 PM
> > To: wangyunjian 
> > Cc: Michael S. Tsirkin ; Paolo Abeni ;
> > willemdebruijn.ker...@gmail.com; k...@kernel.org; bj...@kernel.org;
> > magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> > jonathan.le...@gmail.com; da...@davemloft.net; b...@vger.kernel.org;
> > net...@vger.kernel.org; linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> > 
> > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> >
> > On Mon, Mar 4, 2024 at 9:45 PM wangyunjian 
> > wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Michael S. Tsirkin [mailto:m...@redhat.com]
> > > > Sent: Friday, March 1, 2024 7:53 PM
> > > > To: wangyunjian 
> > > > Cc: Paolo Abeni ;
> > > > willemdebruijn.ker...@gmail.com; jasow...@redhat.com;
> > > > k...@kernel.org; bj...@kernel.org; magnus.karls...@intel.com;
> > > > maciej.fijalkow...@intel.com; jonathan.le...@gmail.com;
> > > > da...@davemloft.net; b...@vger.kernel.org; net...@vger.kernel.org;
> > > > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > > > virtualizat...@lists.linux.dev; xudingke ;
> > > > liwei (DT) 
> > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > > support
> > > >
> > > > On Fri, Mar 01, 2024 at 11:45:52AM +, wangyunjian wrote:
> > > > > > -Original Message-
> > > > > > From: Paolo Abeni [mailto:pab...@redhat.com]
> > > > > > Sent: Thursday, February 29, 2024 7:13 PM
> > > > > > To: wangyunjian ; m...@redhat.com;
> > > > > > willemdebruijn.ker...@gmail.com; jasow...@redhat.com;
> > > > > > k...@kernel.org; bj...@kernel.org; magnus.karls...@intel.com;
> > > > > > maciej.fijalkow...@intel.com; jonathan.le...@gmail.com;
> > > > > > da...@davemloft.net
> > > > > > Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> > > > > > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > > > > > virtualizat...@lists.linux.dev; xudingke ;
> > > > > > liwei (DT) 
> > > > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > > > > support
> > > > > >
> > > > > > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > > > > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > > > > > > }
> > > > > > >  }
> > > > > > >
> > > > > > > +static void tun_peek_xsk(struct tun_file *tfile) {
> > > > > > > +   struct xsk_buff_pool *pool;
> > > > > > > +   u32 i, batch, budget;
> > > > > > > +   void *frame;
> > > > > > > +
> > > > > > > +   if (!ptr_ring_empty(>tx_ring))
> > > > > > > +   return;
> > > > > > > +
> > > > > > > +   spin_lock(>pool_lock);
> > > > > > > +   pool = tfile->xsk_pool;
> > > > > > > +   if (!pool) {
> > > > > > > +   spin_unlock(>pool_lock);
> > > > > > > +   return;
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   if (tfile->nb_descs) {
> > > > > > > +   xsk_tx_completed(pool, tfile->nb_descs);
> > > > > > > +   if (xsk_uses_need_wakeup(pool))
> > > > > > > +   xsk_set_tx_need_wakeup(pool);
> > > > > > > +   }
> > > > > > > +
> > > > > > > +   spin_lock(>tx_ring.producer_lock);
> > > > > > > +   budget = min_t(u32, tfile->tx_ring.size,
> > > > > > > + TUN_XDP_BATCH);
> > > > > > > +
> > > > > > > +   batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > > > > > > +   if (!batch) {
> > > > > >
> > > > > > This branch looks like an unneeded "optimization".

Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-10 Thread Jason Wang

On Mon, Mar 4, 2024 at 9:45 PM wangyunjian  wrote:
>
>
>
> > -Original Message-
> > From: Michael S. Tsirkin [mailto:m...@redhat.com]
> > Sent: Friday, March 1, 2024 7:53 PM
> > To: wangyunjian 
> > Cc: Paolo Abeni ; willemdebruijn.ker...@gmail.com;
> > jasow...@redhat.com; k...@kernel.org; bj...@kernel.org;
> > magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> > jonathan.le...@gmail.com; da...@davemloft.net; b...@vger.kernel.org;
> > net...@vger.kernel.org; linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> > 
> > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> >
> > On Fri, Mar 01, 2024 at 11:45:52AM +, wangyunjian wrote:
> > > > -Original Message-
> > > > From: Paolo Abeni [mailto:pab...@redhat.com]
> > > > Sent: Thursday, February 29, 2024 7:13 PM
> > > > To: wangyunjian ; m...@redhat.com;
> > > > willemdebruijn.ker...@gmail.com; jasow...@redhat.com;
> > > > k...@kernel.org; bj...@kernel.org; magnus.karls...@intel.com;
> > > > maciej.fijalkow...@intel.com; jonathan.le...@gmail.com;
> > > > da...@davemloft.net
> > > > Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> > > > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > > > virtualizat...@lists.linux.dev; xudingke ;
> > > > liwei (DT) 
> > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > > support
> > > >
> > > > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > > > > }
> > > > >  }
> > > > >
> > > > > +static void tun_peek_xsk(struct tun_file *tfile) {
> > > > > +   struct xsk_buff_pool *pool;
> > > > > +   u32 i, batch, budget;
> > > > > +   void *frame;
> > > > > +
> > > > > +   if (!ptr_ring_empty(>tx_ring))
> > > > > +   return;
> > > > > +
> > > > > +   spin_lock(>pool_lock);
> > > > > +   pool = tfile->xsk_pool;
> > > > > +   if (!pool) {
> > > > > +   spin_unlock(>pool_lock);
> > > > > +   return;
> > > > > +   }
> > > > > +
> > > > > +   if (tfile->nb_descs) {
> > > > > +   xsk_tx_completed(pool, tfile->nb_descs);
> > > > > +   if (xsk_uses_need_wakeup(pool))
> > > > > +   xsk_set_tx_need_wakeup(pool);
> > > > > +   }
> > > > > +
> > > > > +   spin_lock(>tx_ring.producer_lock);
> > > > > +   budget = min_t(u32, tfile->tx_ring.size, TUN_XDP_BATCH);
> > > > > +
> > > > > +   batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > > > > +   if (!batch) {
> > > >
> > > > This branch looks like an unneeded "optimization". The generic loop
> > > > below should have the same effect with no measurable perf delta - and
> > smaller code.
> > > > Just remove this.
> > > >
> > > > > +   tfile->nb_descs = 0;
> > > > > +   spin_unlock(>tx_ring.producer_lock);
> > > > > +   spin_unlock(>pool_lock);
> > > > > +   return;
> > > > > +   }
> > > > > +
> > > > > +   tfile->nb_descs = batch;
> > > > > +   for (i = 0; i < batch; i++) {
> > > > > +   /* Encode the XDP DESC flag into lowest bit for 
> > > > > consumer to
> > differ
> > > > > +* XDP desc from XDP buffer and sk_buff.
> > > > > +*/
> > > > > +   frame = tun_xdp_desc_to_ptr(>tx_descs[i]);
> > > > > +   /* The budget must be less than or equal to 
> > > > > tx_ring.size,
> > > > > +* so enqueuing will not fail.
> > > > > +*/
> > > > > +   __ptr_ring_produce(>tx_ring, frame);
> > > > > +   }
> > > > > +   spin_unlock(>tx_ring.producer_lock);
> > > > > +   spin_unlock(>pool_lock);
> > > >
> > > > More related to the general design: it looks wrong. What if
> > > > get_rx_bufs() will fail (ENOBUF) after successful peeking? With no
> > > > more incoming packets, later peek will return 0 and it looks like
> > > > that the half-processed packets will stay in the ring forever???
> > > >
> > > > I think the 'ring produce' part should be moved into tun_do_read().
> > >
> > > Currently, the vhost-net obtains a batch descriptors/sk_buffs from the
> > > ptr_ring and enqueue the batch descriptors/sk_buffs to the
> > > virtqueue'queue, and then consumes the descriptors/sk_buffs from the
> > > virtqueue'queue in sequence. As a result, TUN does not know whether
> > > the batch descriptors have been used up, and thus does not know when to
> > return the batch descriptors.
> > >
> > > So, I think it's reasonable that when vhost-net checks ptr_ring is
> > > empty, it calls peek_len to get new xsk's descs and return the 
> > > descriptors.
> > >
> > > Thanks
> >
> > What you need to think about is that if you peek, another call in parallel 
> > can get
> > the same value at the same time.
>
> Thank you. I have identified a problem. The tx_descs array was created within 
> xsk's

Re: [PATCH net-next v6 5/5] tools: virtio: introduce vhost_net_test

2024-03-05 Thread Jason Wang

On Tue, Mar 5, 2024 at 5:47 PM Paolo Abeni  wrote:
>
> On Wed, 2024-02-28 at 17:30 +0800, Yunsheng Lin wrote:
> > introduce vhost_net_test for both vhost_net tx and rx basing
> > on virtio_test to test vhost_net changing in the kernel.
> >
> > Steps for vhost_net tx testing:
> > 1. Prepare a out buf.
> > 2. Kick the vhost_net to do tx processing.
> > 3. Do the receiving in the tun side.
> > 4. verify the data received by tun is correct.
> >
> > Steps for vhost_net rx testing:
> > 1. Prepare a in buf.
> > 2. Do the sending in the tun side.
> > 3. Kick the vhost_net to do rx processing.
> > 4. verify the data received by vhost_net is correct.
> >
> > Signed-off-by: Yunsheng Lin 
>
> @Jason: AFAICS this addresses the points you raised on v5, could you
> please have a look?
>
> Thanks!

Looks good to me.

Acked-by: Jason Wang 

Thanks

>
> Paolo
>
>

Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-05 Thread Jason Wang

On Sat, Mar 2, 2024 at 2:40 AM Willem de Bruijn
 wrote:
>
> Maciej Fijalkowski wrote:
> > On Wed, Feb 28, 2024 at 07:05:56PM +0800, Yunjian Wang wrote:
> > > This patch set allows TUN to support the AF_XDP Tx zero-copy feature,
> > > which can significantly reduce CPU utilization for XDP programs.
> >
> > Why no Rx ZC support though? What will happen if I try rxdrop xdpsock
> > against tun with this patch? You clearly allow for that.
>
> This is AF_XDP receive zerocopy, right?
>
> The naming is always confusing with tun, but even though from a tun
> PoV this happens on ndo_start_xmit, it is the AF_XDP equivalent to
> tun_put_user.
>
> So the implementation is more like other device's Rx ZC.
>
> I would have preferred that name, but I think Jason asked for this
> and given tun's weird status, there is something bo said for either.
>

>From the the view of the AF_XDP userspace program, it's the TX path,
and as you said it happens on the TUN xmit path as well. When using
with a VM, it's the RX path.

So TX seems better.

Thanks

Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-05 Thread Jason Wang

On Mon, Mar 4, 2024 at 7:24 PM wangyunjian  wrote:
>
>
>
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Monday, March 4, 2024 2:56 PM
> > To: wangyunjian 
> > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> > bj...@kernel.org; magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> > jonathan.le...@gmail.com; da...@davemloft.net; b...@vger.kernel.org;
> > net...@vger.kernel.org; linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> > 
> > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> >
> > On Wed, Feb 28, 2024 at 7:06 PM Yunjian Wang 
> > wrote:
> > >
> > > This patch set allows TUN to support the AF_XDP Tx zero-copy feature,
> > > which can significantly reduce CPU utilization for XDP programs.
> > >
> > > Since commit fc72d1d54dd9 ("tuntap: XDP transmission"), the pointer
> > > ring has been utilized to queue different types of pointers by
> > > encoding the type into the lower bits. Therefore, we introduce a new
> > > flag, TUN_XDP_DESC_FLAG(0x2UL), which allows us to enqueue XDP
> > > descriptors and differentiate them from XDP buffers and sk_buffs.
> > > Additionally, a spin lock is added for enabling and disabling operations 
> > > on the
> > xsk pool.
> > >
> > > The performance testing was performed on a Intel E5-2620 2.40GHz
> > machine.
> > > Traffic were generated/send through TUN(testpmd txonly with AF_XDP) to
> > > VM (testpmd rxonly in guest).
> > >
> > > +--+-+-+-+
> > > |  |   copy  |zero-copy| speedup |
> > > +--+-+-+-+
> > > | UDP  |   Mpps  |   Mpps  |%|
> > > | 64   |   2.5   |   4.0   |   60%   |
> > > | 512  |   2.1   |   3.6   |   71%   |
> > > | 1024 |   1.9   |   3.3   |   73%   |
> > > +--+-+-+-+
> > >
> > > Signed-off-by: Yunjian Wang 
> > > ---
> > >  drivers/net/tun.c  | 177
> > +++--
> > >  drivers/vhost/net.c|   4 +
> > >  include/linux/if_tun.h |  32 
> > >  3 files changed, 208 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c index
> > > bc80fc1d576e..7f4ff50b532c 100644
> > > --- a/drivers/net/tun.c
> > > +++ b/drivers/net/tun.c
> > > @@ -63,6 +63,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >  #include 
> > >  #include 
> > >  #include 
> > > @@ -86,6 +87,7 @@ static void tun_default_link_ksettings(struct
> > net_device *dev,
> > >struct
> > ethtool_link_ksettings
> > > *cmd);
> > >
> > >  #define TUN_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
> > > +#define TUN_XDP_BATCH 64
> > >
> > >  /* TUN device flags */
> > >
> > > @@ -146,6 +148,9 @@ struct tun_file {
> > > struct tun_struct *detached;
> > > struct ptr_ring tx_ring;
> > > struct xdp_rxq_info xdp_rxq;
> > > +   struct xsk_buff_pool *xsk_pool;
> > > +   spinlock_t pool_lock;   /* Protects xsk pool enable/disable */
> > > +   u32 nb_descs;
> > >  };
> > >
> > >  struct tun_page {
> > > @@ -614,6 +619,8 @@ void tun_ptr_free(void *ptr)
> > > struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
> > >
> > > xdp_return_frame(xdpf);
> > > +   } else if (tun_is_xdp_desc_frame(ptr)) {
> > > +   return;
> > > } else {
> > > __skb_array_destroy_skb(ptr);
> > > }
> > > @@ -631,6 +638,37 @@ static void tun_queue_purge(struct tun_file *tfile)
> > > skb_queue_purge(>sk.sk_error_queue);
> > >  }
> > >
> > > +static void tun_set_xsk_pool(struct tun_file *tfile, struct
> > > +xsk_buff_pool *pool) {
> > > +   if (!pool)
> > > +   return;
> > > +
> > > +   spin_lock(>pool_lock);
> > > +   xsk_pool_set_rxq_info(pool, >xdp_rxq);
> > > +   tfile->xsk_pool = pool;
> > > +   spin_unlock(>pool_lock); }
> > > +
> > > +static void tun_clean_xsk_pool(struct tun_file *tfile)

Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-03 Thread Jason Wang

On Wed, Feb 28, 2024 at 7:06 PM Yunjian Wang  wrote:
>
> This patch set allows TUN to support the AF_XDP Tx zero-copy feature,
> which can significantly reduce CPU utilization for XDP programs.
>
> Since commit fc72d1d54dd9 ("tuntap: XDP transmission"), the pointer
> ring has been utilized to queue different types of pointers by encoding
> the type into the lower bits. Therefore, we introduce a new flag,
> TUN_XDP_DESC_FLAG(0x2UL), which allows us to enqueue XDP descriptors
> and differentiate them from XDP buffers and sk_buffs. Additionally, a
> spin lock is added for enabling and disabling operations on the xsk pool.
>
> The performance testing was performed on a Intel E5-2620 2.40GHz machine.
> Traffic were generated/send through TUN(testpmd txonly with AF_XDP)
> to VM (testpmd rxonly in guest).
>
> +--+-+-+-+
> |  |   copy  |zero-copy| speedup |
> +--+-+-+-+
> | UDP  |   Mpps  |   Mpps  |%|
> | 64   |   2.5   |   4.0   |   60%   |
> | 512  |   2.1   |   3.6   |   71%   |
> | 1024 |   1.9   |   3.3   |   73%   |
> +--+-+-+-+
>
> Signed-off-by: Yunjian Wang 
> ---
>  drivers/net/tun.c  | 177 +++--
>  drivers/vhost/net.c|   4 +
>  include/linux/if_tun.h |  32 
>  3 files changed, 208 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index bc80fc1d576e..7f4ff50b532c 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -63,6 +63,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -86,6 +87,7 @@ static void tun_default_link_ksettings(struct net_device 
> *dev,
>struct ethtool_link_ksettings *cmd);
>
>  #define TUN_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
> +#define TUN_XDP_BATCH 64
>
>  /* TUN device flags */
>
> @@ -146,6 +148,9 @@ struct tun_file {
> struct tun_struct *detached;
> struct ptr_ring tx_ring;
> struct xdp_rxq_info xdp_rxq;
> +   struct xsk_buff_pool *xsk_pool;
> +   spinlock_t pool_lock;   /* Protects xsk pool enable/disable */
> +   u32 nb_descs;
>  };
>
>  struct tun_page {
> @@ -614,6 +619,8 @@ void tun_ptr_free(void *ptr)
> struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
>
> xdp_return_frame(xdpf);
> +   } else if (tun_is_xdp_desc_frame(ptr)) {
> +   return;
> } else {
> __skb_array_destroy_skb(ptr);
> }
> @@ -631,6 +638,37 @@ static void tun_queue_purge(struct tun_file *tfile)
> skb_queue_purge(>sk.sk_error_queue);
>  }
>
> +static void tun_set_xsk_pool(struct tun_file *tfile, struct xsk_buff_pool 
> *pool)
> +{
> +   if (!pool)
> +   return;
> +
> +   spin_lock(>pool_lock);
> +   xsk_pool_set_rxq_info(pool, >xdp_rxq);
> +   tfile->xsk_pool = pool;
> +   spin_unlock(>pool_lock);
> +}
> +
> +static void tun_clean_xsk_pool(struct tun_file *tfile)
> +{
> +   spin_lock(>pool_lock);
> +   if (tfile->xsk_pool) {
> +   void *ptr;
> +
> +   while ((ptr = ptr_ring_consume(>tx_ring)) != NULL)
> +   tun_ptr_free(ptr);
> +
> +   if (tfile->nb_descs) {
> +   xsk_tx_completed(tfile->xsk_pool, tfile->nb_descs);
> +   if (xsk_uses_need_wakeup(tfile->xsk_pool))
> +   xsk_set_tx_need_wakeup(tfile->xsk_pool);
> +   tfile->nb_descs = 0;
> +   }
> +   tfile->xsk_pool = NULL;
> +   }
> +   spin_unlock(>pool_lock);
> +}
> +
>  static void __tun_detach(struct tun_file *tfile, bool clean)
>  {
> struct tun_file *ntfile;
> @@ -648,6 +686,11 @@ static void __tun_detach(struct tun_file *tfile, bool 
> clean)
> u16 index = tfile->queue_index;
> BUG_ON(index >= tun->numqueues);
>
> +   ntfile = rtnl_dereference(tun->tfiles[tun->numqueues - 1]);
> +   /* Stop xsk zc xmit */
> +   tun_clean_xsk_pool(tfile);
> +   tun_clean_xsk_pool(ntfile);
> +
> rcu_assign_pointer(tun->tfiles[index],
>tun->tfiles[tun->numqueues - 1]);
> ntfile = rtnl_dereference(tun->tfiles[index]);
> @@ -668,6 +711,7 @@ static void __tun_detach(struct tun_file *tfile, bool 
> clean)
> tun_flow_delete_by_queue(tun, tun->numqueues + 1);
> /* Drop read queue */
> tun_queue_purge(tfile);
> +   tun_set_xsk_pool(ntfile, xsk_get_pool_from_qid(tun->dev, 
> index));
> tun_set_real_num_queues(tun);
> } else if (tfile->detached && clean) {
> tun = tun_enable_queue(tfile);
> @@ -801,6 +845,7 @@ static int tun_attach(struct tun_struct *tun, struct file 
> *file,
>
> if

Re: [PATCH net-next v4 2/2] virtio-net: add cond_resched() to the command waiting loop

2024-02-25 Thread Jason Wang

On Fri, Feb 23, 2024 at 3:22 AM Michael S. Tsirkin  wrote:
>
> On Tue, Jul 25, 2023 at 11:03:11AM +0800, Jason Wang wrote:
> > On Mon, Jul 24, 2023 at 3:18 PM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Jul 24, 2023 at 02:52:49PM +0800, Jason Wang wrote:
> > > > On Mon, Jul 24, 2023 at 2:46 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Fri, Jul 21, 2023 at 10:18:03PM +0200, Maxime Coquelin wrote:
> > > > > >
> > > > > >
> > > > > > On 7/21/23 17:10, Michael S. Tsirkin wrote:
> > > > > > > On Fri, Jul 21, 2023 at 04:58:04PM +0200, Maxime Coquelin wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 7/21/23 16:45, Michael S. Tsirkin wrote:
> > > > > > > > > On Fri, Jul 21, 2023 at 04:37:00PM +0200, Maxime Coquelin 
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 7/20/23 23:02, Michael S. Tsirkin wrote:
> > > > > > > > > > > On Thu, Jul 20, 2023 at 01:26:20PM -0700, Shannon Nelson 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On 7/20/23 1:38 AM, Jason Wang wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Adding cond_resched() to the command waiting loop for 
> > > > > > > > > > > > > a better
> > > > > > > > > > > > > co-operation with the scheduler. This allows to give 
> > > > > > > > > > > > > CPU a breath to
> > > > > > > > > > > > > run other task(workqueue) instead of busy looping 
> > > > > > > > > > > > > when preemption is
> > > > > > > > > > > > > not allowed on a device whose CVQ might be slow.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Jason Wang 
> > > > > > > > > > > >
> > > > > > > > > > > > This still leaves hung processes, but at least it 
> > > > > > > > > > > > doesn't pin the CPU any
> > > > > > > > > > > > more.  Thanks.
> > > > > > > > > > > > Reviewed-by: Shannon Nelson 
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I'd like to see a full solution
> > > > > > > > > > > 1- block until interrupt
> > > > > > > > > >
> > > > > > > > > > Would it make sense to also have a timeout?
> > > > > > > > > > And when timeout expires, set FAILED bit in device status?
> > > > > > > > >
> > > > > > > > > virtio spec does not set any limits on the timing of vq
> > > > > > > > > processing.
> > > > > > > >
> > > > > > > > Indeed, but I thought the driver could decide it is too long 
> > > > > > > > for it.
> > > > > > > >
> > > > > > > > The issue is we keep waiting with rtnl locked, it can quickly 
> > > > > > > > make the
> > > > > > > > system unusable.
> > > > > > >
> > > > > > > if this is a problem we should find a way not to keep rtnl
> > > > > > > locked indefinitely.
> > > > > >
> > > > > > From the tests I have done, I think it is. With OVS, a 
> > > > > > reconfiguration is
> > > > > > performed when the VDUSE device is added, and when a MLX5 device is
> > > > > > in the same bridge, it ends up doing an ioctl() that tries to take 
> > > > > > the
> > > > > > rtnl lock. In this configuration, it is not possible to kill OVS 
> > > > > > because
> > > > > > it is stuck trying to acquire rtnl lock for mlx5 that is held by 
> > > > > > virtio-
> > > > > > net.
> > > > >
> > > > > So fo

Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-25 Thread Jason Wang

On Fri, Feb 23, 2024 at 9:42 AM Xuan Zhuo  wrote:
>
> On Fri, 09 Feb 2024 13:57:25 +0100, Paolo Abeni  wrote:
> > On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote:
> > > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  wrote:
> > > >
> > > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  wrote:
> > > > > >
> > > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > > > > >  wrote:
> > > > > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > > > > [...]
> > > > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct 
> > > > > > > > > xdp_buff *xdp)
> > > > > > > > >   }
> > > > > > > > >   }
> > > > > > > > >
> > > > > > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > > > > > > > *virtnet_xdp,
> > > > > > > > > +  struct net_device *dev,
> > > > > > > > > +  struct 
> > > > > > > > > virtio_net_hdr_v1_hash *hdr_hash)
> > > > > > > > > +{
> > > > > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > > > > > > > + virtnet_xdp->hash_report = 
> > > > > > > > > hdr_hash->hash_report;
> > > > > > > > > + }
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > >
> > > > > > > > Would it be possible to store a pointer to hdr_hash in 
> > > > > > > > virtnet_xdp_buff,
> > > > > > > > with the purpose of delaying extracting this, until and only if 
> > > > > > > > XDP
> > > > > > > > bpf_prog calls the kfunc?
> > > > > > > >
> > > > > > >
> > > > > > > That seems to be the way v1 works,
> > > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > > > > . But it was pointed out that the inline header may be 
> > > > > > > overwritten by
> > > > > > > the xdp prog, so the hash is copied out to maintain its integrity.
> > > > > >
> > > > > > Why? isn't XDP supposed to get write access only to the pkt
> > > > > > contents/buffer?
> > > > > >
> > > > >
> > > > > Normally, an XDP program accesses only the packet data. However,
> > > > > there's also an XDP RX Metadata area, referenced by the data_meta
> > > > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> > > > > point somewhere ahead of the data buffer, thereby granting the XDP
> > > > > program access to the virtio header located immediately before the
> > > >
> > > > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before
> > > > xdp->data_hard_start:
> > > >
> > > > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210
> > > >
> > > > and virtio net set such field after the virtio_net_hdr:
> > > >
> > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
> > > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420
> > > >
> > > > I don't see how the virtio hdr could be touched? Possibly even more
> > > > important: if such thing is possible, I think is should be somewhat
> > > > denied (for the same reason an H/W nic should prevent XDP from
> > > > modifying its own buffer descriptor).
> > >
> > > Thank you for highlighting this concern. The header layout differs
> > > slightly between small and mergeable mode. Taking 'mergeable mode' as
> > > an example, after calling xdp_prepare_buff the layout of xdp_buff
> > > would be as depicted in the diagram below,
> > >
> > >   buf
> > >|
> > >v
> > > +--+--+-+
> > > | xdp headroom | virtio header| packet  |
> > > | (256 bytes)  | (20 bytes)   | content |
> > > +--+--+-+
> > > ^ ^
> > > | |
> > >  data_hard_startdata
> > >   data_meta
> > >
> > > If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little
> > > towards 'data_hard_start', it would point to the inline header, thus
> > > potentially allowing the XDP program to access the inline header.
> >
> > I see. That layout was completely unexpected to me.
> >
> > AFAICS the virtio_net driver tries to avoid accessing/using the
> > virtio_net_hdr after the XDP program execution, so nothing tragic
> > should happen.
> >
> > @Michael, @Jason, I guess the above is like that by design? Isn't it a
> > bit fragile?

Yes.

>
> YES. We process it carefully. That brings some troubles, we hope to put the
> virtio-net header to the vring desc like other NICs. But that is a big 
> project.

Yes, and we still need to support the "legacy" layout.

>
> I think this patch is ok, this can be merged to net-next firstly.

+1

Thanks

>
> Thanks.
>
>
> >
> > Thanks!
> >
> > Paolo
> >
>

Re: [PATCH] vhost-vdpa: fail enabling virtqueue in certain conditions

2024-02-06 Thread Jason Wang

On Tue, Feb 6, 2024 at 10:52 PM Stefano Garzarella  wrote:
>
> If VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK is not negotiated, we expect
> the driver to enable virtqueue before setting DRIVER_OK. If the driver
> tries anyway, better to fail right away as soon as we get the ioctl.
> Let's also update the documentation to make it clearer.
>
> We had a problem in QEMU for not meeting this requirement, see
> https://lore.kernel.org/qemu-devel/20240202132521.32714-1-kw...@redhat.com/

Maybe it's better to only enable cvq when the backend supports
VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK. Eugenio, any comment on this?

>
> Fixes: 9f09fd6171fe ("vdpa: accept VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK 
> backend feature")
> Cc: epere...@redhat.com
> Signed-off-by: Stefano Garzarella 
> ---
>  include/uapi/linux/vhost_types.h | 3 ++-
>  drivers/vhost/vdpa.c | 4 
>  2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/vhost_types.h 
> b/include/uapi/linux/vhost_types.h
> index d7656908f730..5df49b6021a7 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -182,7 +182,8 @@ struct vhost_vdpa_iova_range {
>  /* Device can be resumed */
>  #define VHOST_BACKEND_F_RESUME  0x5
>  /* Device supports the driver enabling virtqueues both before and after
> - * DRIVER_OK
> + * DRIVER_OK. If this feature is not negotiated, the virtqueues must be
> + * enabled before setting DRIVER_OK.
>   */
>  #define VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK  0x6
>  /* Device may expose the virtqueue's descriptor area, driver area and
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index bc4a51e4638b..1fba305ba8c1 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -651,6 +651,10 @@ static long vhost_vdpa_vring_ioctl(struct vhost_vdpa *v, 
> unsigned int cmd,
> case VHOST_VDPA_SET_VRING_ENABLE:
> if (copy_from_user(, argp, sizeof(s)))
> return -EFAULT;
> +   if (!vhost_backend_has_feature(vq,
> +   VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK) &&
> +   (ops->get_status(vdpa) & VIRTIO_CONFIG_S_DRIVER_OK))
> +   return -EINVAL;

As discussed, without VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK, we don't
know if parents can do vq_ready after driver_ok.

So maybe we need to keep this behaviour to unbreak some "legacy" userspace?

For example ifcvf did:

static void ifcvf_vdpa_set_vq_ready(struct vdpa_device *vdpa_dev,
u16 qid, bool ready)
{
  struct ifcvf_hw *vf = vdpa_to_vf(vdpa_dev);

ifcvf_set_vq_ready(vf, qid, ready);
}

And it did:

void ifcvf_set_vq_ready(struct ifcvf_hw *hw, u16 qid, bool ready)
{
struct virtio_pci_common_cfg __iomem *cfg = hw->common_cfg;

vp_iowrite16(qid, >queue_select);
vp_iowrite16(ready, >queue_enable);
}

Though it didn't advertise VHOST_BACKEND_F_ENABLE_AFTER_DRIVER_OK?

Adding LingShan for more thought.

Thanks

> ops->set_vq_ready(vdpa, idx, s.num);
> return 0;
> case VHOST_VDPA_GET_VRING_GROUP:
> --
> 2.43.0
>

Re: [PATCH net-next v5 5/5] tools: virtio: introduce vhost_net_test

2024-02-05 Thread Jason Wang

On Mon, Feb 5, 2024 at 8:46 PM Yunsheng Lin  wrote:
>
> introduce vhost_net_test for both vhost_net tx and rx basing
> on virtio_test to test vhost_net changing in the kernel.
>
> Steps for vhost_net tx testing:
> 1. Prepare a out buf.
> 2. Kick the vhost_net to do tx processing.
> 3. Do the receiving in the tun side.
> 4. verify the data received by tun is correct.
>
> Steps for vhost_net rx testing:
> 1. Prepare a in buf.
> 2. Do the sending in the tun side.
> 3. Kick the vhost_net to do rx processing.
> 4. verify the data received by vhost_net is correct.
>
> Signed-off-by: Yunsheng Lin 
> ---
>  tools/virtio/.gitignore|   1 +
>  tools/virtio/Makefile  |   8 +-
>  tools/virtio/linux/virtio_config.h |   4 +
>  tools/virtio/vhost_net_test.c  | 536 +
>  4 files changed, 546 insertions(+), 3 deletions(-)
>  create mode 100644 tools/virtio/vhost_net_test.c
>
> diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore
> index 9934d48d9a55..7e47b281c442 100644
> --- a/tools/virtio/.gitignore
> +++ b/tools/virtio/.gitignore
> @@ -1,5 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  *.d
>  virtio_test
> +vhost_net_test
>  vringh_test
>  virtio-trace/trace-agent
> diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
> index d128925980e0..e25e99c1c3b7 100644
> --- a/tools/virtio/Makefile
> +++ b/tools/virtio/Makefile
> @@ -1,8 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0
>  all: test mod
> -test: virtio_test vringh_test
> +test: virtio_test vringh_test vhost_net_test
>  virtio_test: virtio_ring.o virtio_test.o
>  vringh_test: vringh_test.o vringh.o virtio_ring.o
> +vhost_net_test: virtio_ring.o vhost_net_test.o
>
>  try-run = $(shell set -e;  \
> if ($(1)) >/dev/null 2>&1;  \
> @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
>
>  .PHONY: all test mod clean vhost oot oot-clean oot-build
>  clean:
> -   ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
> -  vhost_test/Module.symvers vhost_test/modules.order *.d
> +   ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
> +  vhost_test/.*.cmd vhost_test/Module.symvers \
> +  vhost_test/modules.order *.d
>  -include *.d
> diff --git a/tools/virtio/linux/virtio_config.h 
> b/tools/virtio/linux/virtio_config.h
> index 2a8a70e2a950..42a564f22f2d 100644
> --- a/tools/virtio/linux/virtio_config.h
> +++ b/tools/virtio/linux/virtio_config.h
> @@ -1,4 +1,6 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef LINUX_VIRTIO_CONFIG_H
> +#define LINUX_VIRTIO_CONFIG_H
>  #include 
>  #include 
>  #include 
> @@ -95,3 +97,5 @@ static inline __virtio64 cpu_to_virtio64(struct 
> virtio_device *vdev, u64 val)
>  {
> return __cpu_to_virtio64(virtio_is_little_endian(vdev), val);
>  }
> +
> +#endif
> diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
> new file mode 100644
> index ..6c41204e6707
> --- /dev/null
> +++ b/tools/virtio/vhost_net_test.c
> @@ -0,0 +1,536 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define HDR_LENsizeof(struct virtio_net_hdr_mrg_rxbuf)
> +#define TEST_BUF_LEN   256
> +#define TEST_PTYPE ETH_P_LOOPBACK
> +#define DESC_NUM   256
> +
> +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
> +
> +struct vq_info {
> +   int kick;
> +   int call;
> +   int idx;
> +   long started;
> +   long completed;
> +   struct pollfd fds;
> +   void *ring;
> +   /* copy used for control */
> +   struct vring vring;
> +   struct virtqueue *vq;
> +};
> +
> +struct vdev_info {
> +   struct virtio_device vdev;
> +   int control;
> +   struct vq_info vqs[2];
> +   int nvqs;
> +   void *buf;
> +   size_t buf_size;
> +   char *test_buf;
> +   char *res_buf;
> +   struct vhost_memory *mem;
> +   int sock;
> +   int ifindex;
> +   unsigned char mac[ETHER_ADDR_LEN];
> +};
> +
> +static int tun_alloc(struct vdev_info *dev, char *tun_name)
> +{
> +   struct ifreq ifr;
> +   int len = HDR_LEN;
> +   int fd, e;
> +
> +   fd = open("/dev/net/tun", O_RDWR);
> +   if (fd < 0) {
> +   perror("Cannot open /dev/net/tun");
> +   return fd;
> +   }
> +
> +   memset(, 0, sizeof(ifr));
> +
> +   ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
> +   strncpy(ifr.ifr_name, tun_name, IFNAMSIZ);
> +
> +   e = ioctl(fd, TUNSETIFF, );
> +   if (e < 0) {
> +   perror("ioctl[TUNSETIFF]");
> +   close(fd);
> +

Re: [PATCH] virtio: make virtio_bus const

2024-02-05 Thread Jason Wang

On Mon, Feb 5, 2024 at 4:52 AM Ricardo B. Marliere  wrote:
>
> Now that the driver core can properly handle constant struct bus_type,
> move the virtio_bus variable to be a constant structure as well,
> placing it into read-only memory which can not be modified at runtime.
>
> Cc: Greg Kroah-Hartman 
> Suggested-by: Greg Kroah-Hartman 
> Signed-off-by: Ricardo B. Marliere 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test

2024-02-04 Thread Jason Wang

On Sun, Feb 4, 2024 at 11:50 AM Yunsheng Lin  wrote:
>
> On 2024/2/4 9:30, Jason Wang wrote:
> > On Fri, Feb 2, 2024 at 8:24 PM Yunsheng Lin  wrote:
> >>
> >> On 2024/2/2 12:05, Jason Wang wrote:
> >>> On Tue, Jan 30, 2024 at 7:38 PM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> introduce vhost_net_test basing on virtio_test to test
> >>>> vhost_net changing in the kernel.
> >>>
> >>> Let's describe what kind of test is being done and how it is done here.
> >>
> >> How about something like below:
> >>
> >> This patch introduces testing for both vhost_net tx and rx.
> >> Steps for vhost_net tx testing:
> >> 1. Prepare a out buf
> >> 2. Kick the vhost_net to do tx processing
> >> 3. Do the receiving in the tun side
> >> 4. verify the data received by tun is correct
> >>
> >> Steps for vhost_net rx testing::
> >> 1. Prepare a in buf
> >> 2. Do the sending in the tun side
> >> 3. Kick the vhost_net to do rx processing
> >> 4. verify the data received by vhost_net is correct
> >
> > It looks like some important details were lost, e.g the logic for batching 
> > etc.
>
> I am supposeing you are referring to the virtio desc batch handling，
> right?

Yes.

>
> It was a copy & paste code of virtio_test.c, I was thinking about removing
> the virtio desc batch handling for now, as this patchset does not require
> that to do the testing, it mainly depend on the "sock->sk->sk_sndbuf" to
> be INT_MAX to call vhost_net_build_xdp(), which seems to be the default
> case for vhost_net.

Ok.

>
> >
> >>
>
> ...
>
> >>>> +static void vdev_create_socket(struct vdev_info *dev)
> >>>> +{
> >>>> +   struct ifreq ifr;
> >>>> +
> >>>> +   dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
> >>>> +   assert(dev->sock != -1);
> >>>> +
> >>>> +   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> >>>
> >>> Nit: it might be better to accept the device name instead of repeating
> >>> the snprintf trick here, this would facilitate the future changes.
> >>
> >> I am not sure I understand what did you mean by "accept the device name"
> >> here.
> >>
> >> The above is used to get ifindex of the tun netdevice created in
> >> tun_alloc(), so that we can use it in vdev_send_packet() to send
> >> a packet using the tun netdevice created in tun_alloc(). Is there
> >> anything obvious I missed here?
> >
> > I meant a const char *ifname for this function and let the caller to
> > pass the name.
>
> Sure.
>
> >
> >>
>
> >>>> +
> >>>> +static void run_rx_test(struct vdev_info *dev, struct vq_info *vq,
> >>>> +   bool delayed, int batch, int bufs)
> >>>> +{
> >>>> +   const bool random_batch = batch == RANDOM_BATCH;
> >>>> +   long long spurious = 0;
> >>>> +   struct scatterlist sl;
> >>>> +   unsigned int len;
> >>>> +   int r;
> >>>> +
> >>>> +   for (;;) {
> >>>> +   long started_before = vq->started;
> >>>> +   long completed_before = vq->completed;
> >>>> +
> >>>> +   do {
> >>>> +   if (random_batch)
> >>>> +   batch = (random() % vq->vring.num) + 1;
> >>>> +
> >>>> +   while (vq->started < bufs &&
> >>>> +  (vq->started - vq->completed) < batch) {
> >>>> +   sg_init_one(, dev->res_buf, HDR_LEN + 
> >>>> TEST_BUF_LEN);
> >>>> +
> >>>> +   r = virtqueue_add_inbuf(vq->vq, , 1,
> >>>> +   dev->res_buf + 
> >>>> vq->started,
> >>>> +   GFP_ATOMIC);
> >>>> +   if (unlikely(r != 0)) {
> >>>> +   if (r == -ENOSPC &&
> >>>
> >>> Drivers usually maintain a #free_slots, this can help

Re: [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test

2024-02-03 Thread Jason Wang

On Fri, Feb 2, 2024 at 8:24 PM Yunsheng Lin  wrote:
>
> On 2024/2/2 12:05, Jason Wang wrote:
> > On Tue, Jan 30, 2024 at 7:38 PM Yunsheng Lin  wrote:
> >>
> >> introduce vhost_net_test basing on virtio_test to test
> >> vhost_net changing in the kernel.
> >
> > Let's describe what kind of test is being done and how it is done here.
>
> How about something like below:
>
> This patch introduces testing for both vhost_net tx and rx.
> Steps for vhost_net tx testing:
> 1. Prepare a out buf
> 2. Kick the vhost_net to do tx processing
> 3. Do the receiving in the tun side
> 4. verify the data received by tun is correct
>
> Steps for vhost_net rx testing::
> 1. Prepare a in buf
> 2. Do the sending in the tun side
> 3. Kick the vhost_net to do rx processing
> 4. verify the data received by vhost_net is correct

It looks like some important details were lost, e.g the logic for batching etc.

>
>
> >> +
> >> +static int tun_alloc(struct vdev_info *dev)
> >> +{
> >> +   struct ifreq ifr;
> >> +   int len = HDR_LEN;
> >
> > Any reason you can't just use the virtio_net uapi?
>
> I didn't find a macro for that in include/uapi/linux/virtio_net.h.
>
> Did you mean using something like below?
> sizeof(struct virtio_net_hdr_mrg_rxbuf)

Yes.

>
> >
> >> +   int fd, e;
> >> +
> >> +   fd = open("/dev/net/tun", O_RDWR);
> >> +   if (fd < 0) {
> >> +   perror("Cannot open /dev/net/tun");
> >> +   return fd;
> >> +   }
> >> +
> >> +   memset(, 0, sizeof(ifr));
> >> +
> >> +   ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
> >> +   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> >> +
> >> +   e = ioctl(fd, TUNSETIFF, );
> >> +   if (e < 0) {
> >> +   perror("ioctl[TUNSETIFF]");
> >> +   close(fd);
> >> +   return e;
> >> +   }
> >> +
> >> +   e = ioctl(fd, TUNSETVNETHDRSZ, );
> >> +   if (e < 0) {
> >> +   perror("ioctl[TUNSETVNETHDRSZ]");
> >> +   close(fd);
> >> +   return e;
> >> +   }
> >> +
> >> +   e = ioctl(fd, SIOCGIFHWADDR, );
> >> +   if (e < 0) {
> >> +   perror("ioctl[SIOCGIFHWADDR]");
> >> +   close(fd);
> >> +   return e;
> >> +   }
> >> +
> >> +   memcpy(dev->mac, _hwaddr.sa_data, ETHER_ADDR_LEN);
> >> +   return fd;
> >> +}
> >> +
> >> +static void vdev_create_socket(struct vdev_info *dev)
> >> +{
> >> +   struct ifreq ifr;
> >> +
> >> +   dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
> >> +   assert(dev->sock != -1);
> >> +
> >> +   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> >
> > Nit: it might be better to accept the device name instead of repeating
> > the snprintf trick here, this would facilitate the future changes.
>
> I am not sure I understand what did you mean by "accept the device name"
> here.
>
> The above is used to get ifindex of the tun netdevice created in
> tun_alloc(), so that we can use it in vdev_send_packet() to send
> a packet using the tun netdevice created in tun_alloc(). Is there
> anything obvious I missed here?

I meant a const char *ifname for this function and let the caller to
pass the name.

>
> >
> >> +   assert(ioctl(dev->sock, SIOCGIFINDEX, ) >= 0);
> >> +
> >> +   dev->ifindex = ifr.ifr_ifindex;
> >> +
> >> +   /* Set the flags that bring the device up */
> >> +   assert(ioctl(dev->sock, SIOCGIFFLAGS, ) >= 0);
> >> +   ifr.ifr_flags |= (IFF_UP | IFF_RUNNING);
> >> +   assert(ioctl(dev->sock, SIOCSIFFLAGS, ) >= 0);
> >> +}
> >> +
> >> +static void vdev_send_packet(struct vdev_info *dev)
> >> +{
> >> +   char *sendbuf = dev->test_buf + HDR_LEN;
> >> +   struct sockaddr_ll saddrll = {0};
> >> +   int sockfd = dev->sock;
> >> +   int ret;
> >> +
> >> +   saddrll.sll_family = PF_PACKET;
> >> +   saddrll.sll_ifindex = dev->ifindex;
> >> +   saddrll.sll_halen = ETH_ALEN;
> >> +   saddrll.sll_protocol = htons(TEST_P

Re: [PATCH net-next v4 5/5] tools: virtio: introduce vhost_net_test

2024-02-01 Thread Jason Wang

On Tue, Jan 30, 2024 at 7:38 PM Yunsheng Lin  wrote:
>
> introduce vhost_net_test basing on virtio_test to test
> vhost_net changing in the kernel.

Let's describe what kind of test is being done and how it is done here.

>
> Signed-off-by: Yunsheng Lin 
> ---
>  tools/virtio/.gitignore   |   1 +
>  tools/virtio/Makefile |   8 +-
>  tools/virtio/vhost_net_test.c | 576 ++
>  3 files changed, 582 insertions(+), 3 deletions(-)
>  create mode 100644 tools/virtio/vhost_net_test.c
>
> diff --git a/tools/virtio/.gitignore b/tools/virtio/.gitignore
> index 9934d48d9a55..7e47b281c442 100644
> --- a/tools/virtio/.gitignore
> +++ b/tools/virtio/.gitignore
> @@ -1,5 +1,6 @@
>  # SPDX-License-Identifier: GPL-2.0-only
>  *.d
>  virtio_test
> +vhost_net_test
>  vringh_test
>  virtio-trace/trace-agent
> diff --git a/tools/virtio/Makefile b/tools/virtio/Makefile
> index d128925980e0..e25e99c1c3b7 100644
> --- a/tools/virtio/Makefile
> +++ b/tools/virtio/Makefile
> @@ -1,8 +1,9 @@
>  # SPDX-License-Identifier: GPL-2.0
>  all: test mod
> -test: virtio_test vringh_test
> +test: virtio_test vringh_test vhost_net_test
>  virtio_test: virtio_ring.o virtio_test.o
>  vringh_test: vringh_test.o vringh.o virtio_ring.o
> +vhost_net_test: virtio_ring.o vhost_net_test.o
>
>  try-run = $(shell set -e;  \
> if ($(1)) >/dev/null 2>&1;  \
> @@ -49,6 +50,7 @@ oot-clean: OOT_BUILD+=clean
>
>  .PHONY: all test mod clean vhost oot oot-clean oot-build
>  clean:
> -   ${RM} *.o vringh_test virtio_test vhost_test/*.o vhost_test/.*.cmd \
> -  vhost_test/Module.symvers vhost_test/modules.order *.d
> +   ${RM} *.o vringh_test virtio_test vhost_net_test vhost_test/*.o \
> +  vhost_test/.*.cmd vhost_test/Module.symvers \
> +  vhost_test/modules.order *.d
>  -include *.d
> diff --git a/tools/virtio/vhost_net_test.c b/tools/virtio/vhost_net_test.c
> new file mode 100644
> index ..e336792a0d77
> --- /dev/null
> +++ b/tools/virtio/vhost_net_test.c
> @@ -0,0 +1,576 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#define _GNU_SOURCE
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define RANDOM_BATCH   -1
> +#define HDR_LEN12
> +#define TEST_BUF_LEN   256
> +#define TEST_PTYPE ETH_P_LOOPBACK
> +
> +/* Used by implementation of kmalloc() in tools/virtio/linux/kernel.h */
> +void *__kmalloc_fake, *__kfree_ignore_start, *__kfree_ignore_end;
> +
> +struct vq_info {
> +   int kick;
> +   int call;
> +   int idx;
> +   long started;
> +   long completed;
> +   struct pollfd fds;
> +   void *ring;
> +   /* copy used for control */
> +   struct vring vring;
> +   struct virtqueue *vq;
> +};
> +
> +struct vdev_info {
> +   struct virtio_device vdev;
> +   int control;
> +   struct vq_info vqs[2];
> +   int nvqs;
> +   void *buf;
> +   size_t buf_size;
> +   char *test_buf;
> +   char *res_buf;
> +   struct vhost_memory *mem;
> +   int sock;
> +   int ifindex;
> +   unsigned char mac[ETHER_ADDR_LEN];
> +};
> +
> +static int tun_alloc(struct vdev_info *dev)
> +{
> +   struct ifreq ifr;
> +   int len = HDR_LEN;

Any reason you can't just use the virtio_net uapi?

> +   int fd, e;
> +
> +   fd = open("/dev/net/tun", O_RDWR);
> +   if (fd < 0) {
> +   perror("Cannot open /dev/net/tun");
> +   return fd;
> +   }
> +
> +   memset(, 0, sizeof(ifr));
> +
> +   ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_VNET_HDR;
> +   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());
> +
> +   e = ioctl(fd, TUNSETIFF, );
> +   if (e < 0) {
> +   perror("ioctl[TUNSETIFF]");
> +   close(fd);
> +   return e;
> +   }
> +
> +   e = ioctl(fd, TUNSETVNETHDRSZ, );
> +   if (e < 0) {
> +   perror("ioctl[TUNSETVNETHDRSZ]");
> +   close(fd);
> +   return e;
> +   }
> +
> +   e = ioctl(fd, SIOCGIFHWADDR, );
> +   if (e < 0) {
> +   perror("ioctl[SIOCGIFHWADDR]");
> +   close(fd);
> +   return e;
> +   }
> +
> +   memcpy(dev->mac, _hwaddr.sa_data, ETHER_ADDR_LEN);
> +   return fd;
> +}
> +
> +static void vdev_create_socket(struct vdev_info *dev)
> +{
> +   struct ifreq ifr;
> +
> +   dev->sock = socket(AF_PACKET, SOCK_RAW, htons(TEST_PTYPE));
> +   assert(dev->sock != -1);
> +
> +   snprintf(ifr.ifr_name, IFNAMSIZ, "tun_%d", getpid());

Nit: it might be better to accept the device name instead of repeating
the snprintf trick here, this would facilitate the future changes.

> +   assert(ioctl(dev->sock,

Re: [PATCH v4] virtio_net: Support RX hash XDP hint

2024-01-31 Thread Jason Wang

On Wed, Jan 31, 2024 at 11:55 AM Liang Chen  wrote:
>
> The RSS hash report is a feature that's part of the virtio specification.
> Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> (still a work in progress as per [1]) support this feature. While the
> capability to obtain the RSS hash has been enabled in the normal path,
> it's currently missing in the XDP path. Therefore, we are introducing
> XDP hints through kfuncs to allow XDP programs to access the RSS hash.
>
> 1.
> https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
>
> Signed-off-by: Liang Chen 
> Reviewed-by: Xuan Zhuo 

Acked-by: Jason Wang 

Thanks

Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-29 Thread Jason Wang

On Mon, Jan 29, 2024 at 7:40 PM wangyunjian  wrote:
>
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Monday, January 29, 2024 11:03 AM
> > To: wangyunjian 
> > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> > da...@davemloft.net; magnus.karls...@intel.com; net...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke 
> > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> >
> > On Thu, Jan 25, 2024 at 8:54 PM wangyunjian 
> > wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > > Sent: Thursday, January 25, 2024 12:49 PM
> > > > To: wangyunjian 
> > > > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com;
> > > > k...@kernel.org; da...@davemloft.net; magnus.karls...@intel.com;
> > > > net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > > > 
> > > > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> > > >
> > > > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang
> > > > 
> > > > wrote:
> > > > >
> > > > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > > > drivers, which can reduce CPU utilization on the xdp program.
> > > > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > > > >
> > > > > This patch tries to address this by:
> > > > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe 
> > > > > empty.
> > > > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > > > - add tun_put_user_desc function to copy the Rx data to VM
> > > >
> > > > Code explains themselves, let's explain why you need to do this.
> > > >
> > > > 1) why you want to use peek_len
> > > > 2) for "vq's array", what does it mean?
> > > > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so I
> > > > guess you meant TX zerocopy instead of RX (as I don't see codes for
> > > > RX?)
> > >
> > > OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX
> > > zerocopy from the view of vhost-net.
> >
> > Ok.
> >
> > >
> > > >
> > > > A big question is how could you handle GSO packets from
> > userspace/guests?
> > >
> > > Now by disabling VM's TSO and csum feature.
> >
> > Btw, how could you do that?
>
> By set network backend-specific options:
> 
>  mrg_rxbuf='off'/>
> 
> 

This is the mgmt work, but the problem is what happens if GSO is not
disabled in the guest, or is there a way to:

1) forcing the guest GSO to be off
2) a graceful fallback

Thanks

>
> Thanks
>
> >
> > Thanks
> >
>

Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-29 Thread Jason Wang

On Mon, Jan 29, 2024 at 7:10 PM wangyunjian  wrote:
>
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Monday, January 29, 2024 11:05 AM
> > To: wangyunjian 
> > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> > da...@davemloft.net; magnus.karls...@intel.com; net...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke 
> > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> >
> > On Sat, Jan 27, 2024 at 5:34 PM wangyunjian 
> > wrote:
> > >
> > > > > -Original Message-
> > > > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > > > Sent: Thursday, January 25, 2024 12:49 PM
> > > > > To: wangyunjian 
> > > > > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com;
> > > > > k...@kernel.org; da...@davemloft.net; magnus.karls...@intel.com;
> > > > > net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > > > > 
> > > > > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> > > > >
> > > > > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang
> > > > 
> > > > > wrote:
> > > > > >
> > > > > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > > > > drivers, which can reduce CPU utilization on the xdp program.
> > > > > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > > > > >
> > > > > > This patch tries to address this by:
> > > > > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > > > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe
> > empty.
> > > > > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > > > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > > > > - add tun_put_user_desc function to copy the Rx data to VM
> > > > >
> > > > > Code explains themselves, let's explain why you need to do this.
> > > > >
> > > > > 1) why you want to use peek_len
> > > > > 2) for "vq's array", what does it mean?
> > > > > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so
> > > > > I guess you meant TX zerocopy instead of RX (as I don't see codes
> > > > > for
> > > > > RX?)
> > > >
> > > > OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX
> > > > zerocopy from the view of vhost-net.
> > > >
> > > > >
> > > > > A big question is how could you handle GSO packets from
> > userspace/guests?
> > > >
> > > > Now by disabling VM's TSO and csum feature. XDP does not support GSO
> > > > packets.
> > > > However, this feature can be added once XDP supports it in the future.
> > > >
> > > > >
> > > > > >
> > > > > > Signed-off-by: Yunjian Wang 
> > > > > > ---
> > > > > >  drivers/net/tun.c   | 165
> > > > > +++-
> > > > > >  drivers/vhost/net.c |  18 +++--
> > > > > >  2 files changed, 176 insertions(+), 7 deletions(-)
> > >
> > > [...]
> > >
> > > > > >
> > > > > >  static int peek_head_len(struct vhost_net_virtqueue *rvq,
> > > > > > struct sock
> > > > > > *sk)  {
> > > > > > +   struct socket *sock = sk->sk_socket;
> > > > > > struct sk_buff *head;
> > > > > > int len = 0;
> > > > > > unsigned long flags;
> > > > > >
> > > > > > -   if (rvq->rx_ring)
> > > > > > -   return vhost_net_buf_peek(rvq);
> > > > > > +   if (rvq->rx_ring) {
> > > > > > +   len = vhost_net_buf_peek(rvq);
> > > > > > +   if (likely(len))
> > > > > > +   return len;
> > > > > > +   }
> > > > > > +
> > > > > > +   if (sock->ops->peek_len)
> > > > > > +   return sock->ops->peek_len(sock);
> > > > >
> > > > > What prevents you from reusing the ptr_ring here? Then you don't
> > > > > need the above tricks.
> > > >
> > > > Thank you for your suggestion. I will consider how to reuse the 
> > > > ptr_ring.
> > >
> > > If ptr_ring is used to transfer xdp_descs, there is a problem: After
> > > some xdp_descs are obtained through xsk_tx_peek_desc(), the descs may
> > > fail to be added to ptr_ring. However, no API is available to
> > > implement the rollback function.
> >
> > I don't understand, this issue seems to exist in the physical NIC as well?
> >
> > We get more descriptors than the free slots in the NIC ring.
> >
> > How did other NIC solve this issue?
>
> Currently, physical NICs such as i40e, ice, ixgbe, igc, and mlx5 obtains
> available NIC descriptors and then retrieve the same number of xsk
> descriptors for processing.

Any reason we can't do the same? ptr_ring should be much simpler than
NIC ring anyhow.

Thanks

>
> Thanks
>
> >
> > Thanks
> >
> > >
> > > Thanks
> > >
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > >
> > > > > > spin_lock_irqsave(>sk_receive_queue.lock, flags);
> > > > > > head = skb_peek(>sk_receive_queue);
> > > > > > --
> > > > > > 2.33.0
> > > > > >
> > >
>

Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-28 Thread Jason Wang

On Sat, Jan 27, 2024 at 5:34 PM wangyunjian  wrote:
>
> > > -Original Message-
> > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > Sent: Thursday, January 25, 2024 12:49 PM
> > > To: wangyunjian 
> > > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> > > da...@davemloft.net; magnus.karls...@intel.com;
> > > net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > > 
> > > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> > >
> > > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang
> > 
> > > wrote:
> > > >
> > > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > > drivers, which can reduce CPU utilization on the xdp program.
> > > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > > >
> > > > This patch tries to address this by:
> > > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty.
> > > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > > - add tun_put_user_desc function to copy the Rx data to VM
> > >
> > > Code explains themselves, let's explain why you need to do this.
> > >
> > > 1) why you want to use peek_len
> > > 2) for "vq's array", what does it mean?
> > > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so I
> > > guess you meant TX zerocopy instead of RX (as I don't see codes for
> > > RX?)
> >
> > OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX zerocopy
> > from the view of vhost-net.
> >
> > >
> > > A big question is how could you handle GSO packets from userspace/guests?
> >
> > Now by disabling VM's TSO and csum feature. XDP does not support GSO
> > packets.
> > However, this feature can be added once XDP supports it in the future.
> >
> > >
> > > >
> > > > Signed-off-by: Yunjian Wang 
> > > > ---
> > > >  drivers/net/tun.c   | 165
> > > +++-
> > > >  drivers/vhost/net.c |  18 +++--
> > > >  2 files changed, 176 insertions(+), 7 deletions(-)
>
> [...]
>
> > > >
> > > >  static int peek_head_len(struct vhost_net_virtqueue *rvq, struct
> > > > sock
> > > > *sk)  {
> > > > +   struct socket *sock = sk->sk_socket;
> > > > struct sk_buff *head;
> > > > int len = 0;
> > > > unsigned long flags;
> > > >
> > > > -   if (rvq->rx_ring)
> > > > -   return vhost_net_buf_peek(rvq);
> > > > +   if (rvq->rx_ring) {
> > > > +   len = vhost_net_buf_peek(rvq);
> > > > +   if (likely(len))
> > > > +   return len;
> > > > +   }
> > > > +
> > > > +   if (sock->ops->peek_len)
> > > > +   return sock->ops->peek_len(sock);
> > >
> > > What prevents you from reusing the ptr_ring here? Then you don't need
> > > the above tricks.
> >
> > Thank you for your suggestion. I will consider how to reuse the ptr_ring.
>
> If ptr_ring is used to transfer xdp_descs, there is a problem: After some
> xdp_descs are obtained through xsk_tx_peek_desc(), the descs may fail
> to be added to ptr_ring. However, no API is available to implement the
> rollback function.

I don't understand, this issue seems to exist in the physical NIC as well?

We get more descriptors than the free slots in the NIC ring.

How did other NIC solve this issue?

Thanks

>
> Thanks
>
> >
> > >
> > > Thanks
> > >
> > >
> > > >
> > > > spin_lock_irqsave(>sk_receive_queue.lock, flags);
> > > > head = skb_peek(>sk_receive_queue);
> > > > --
> > > > 2.33.0
> > > >
>

Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-28 Thread Jason Wang

On Thu, Jan 25, 2024 at 8:54 PM wangyunjian  wrote:
>
>
>
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Thursday, January 25, 2024 12:49 PM
> > To: wangyunjian 
> > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> > da...@davemloft.net; magnus.karls...@intel.com; net...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke 
> > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> >
> > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang 
> > wrote:
> > >
> > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > drivers, which can reduce CPU utilization on the xdp program.
> > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > >
> > > This patch tries to address this by:
> > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty.
> > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > - add tun_put_user_desc function to copy the Rx data to VM
> >
> > Code explains themselves, let's explain why you need to do this.
> >
> > 1) why you want to use peek_len
> > 2) for "vq's array", what does it mean?
> > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so I guess 
> > you
> > meant TX zerocopy instead of RX (as I don't see codes for
> > RX?)
>
> OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX zerocopy
> from the view of vhost-net.

Ok.

>
> >
> > A big question is how could you handle GSO packets from userspace/guests?
>
> Now by disabling VM's TSO and csum feature.

Btw, how could you do that?

Thanks

Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-24 Thread Jason Wang

On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang  wrote:
>
> Now the zero-copy feature of AF_XDP socket is supported by some
> drivers, which can reduce CPU utilization on the xdp program.
> This patch set allows tun to support AF_XDP Rx zero-copy feature.
>
> This patch tries to address this by:
> - Use peek_len to consume a xsk->desc and get xsk->desc length.
> - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty.
> So add a check for empty vq's array in vhost_net_buf_produce().
> - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> - add tun_put_user_desc function to copy the Rx data to VM

Code explains themselves, let's explain why you need to do this.

1) why you want to use peek_len
2) for "vq's array", what does it mean?
3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so I
guess you meant TX zerocopy instead of RX (as I don't see codes for
RX?)

A big question is how could you handle GSO packets from userspace/guests?

>
> Signed-off-by: Yunjian Wang 
> ---
>  drivers/net/tun.c   | 165 +++-
>  drivers/vhost/net.c |  18 +++--
>  2 files changed, 176 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index afa5497f7c35..248b0f8e07d1 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -77,6 +77,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include 
>  #include 
> @@ -145,6 +146,10 @@ struct tun_file {
> struct tun_struct *detached;
> struct ptr_ring tx_ring;
> struct xdp_rxq_info xdp_rxq;
> +   struct xdp_desc desc;
> +   /* protects xsk pool */
> +   spinlock_t pool_lock;
> +   struct xsk_buff_pool *pool;
>  };
>
>  struct tun_page {
> @@ -208,6 +213,8 @@ struct tun_struct {
> struct bpf_prog __rcu *xdp_prog;
> struct tun_prog __rcu *steering_prog;
> struct tun_prog __rcu *filter_prog;
> +   /* tracks AF_XDP ZC enabled queues */
> +   unsigned long *af_xdp_zc_qps;
> struct ethtool_link_ksettings link_ksettings;
> /* init args */
> struct file *file;
> @@ -795,6 +802,8 @@ static int tun_attach(struct tun_struct *tun, struct file 
> *file,
>
> tfile->queue_index = tun->numqueues;
> tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
> +   tfile->desc.len = 0;
> +   tfile->pool = NULL;
>
> if (tfile->detached) {
> /* Re-attach detached tfile, updating XDP queue_index */
> @@ -989,6 +998,13 @@ static int tun_net_init(struct net_device *dev)
> return err;
> }
>
> +   tun->af_xdp_zc_qps = bitmap_zalloc(MAX_TAP_QUEUES, GFP_KERNEL);
> +   if (!tun->af_xdp_zc_qps) {
> +   security_tun_dev_free_security(tun->security);
> +   free_percpu(dev->tstats);
> +   return -ENOMEM;
> +   }
> +
> tun_flow_init(tun);
>
> dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
> @@ -1009,6 +1025,7 @@ static int tun_net_init(struct net_device *dev)
> tun_flow_uninit(tun);
> security_tun_dev_free_security(tun->security);
> free_percpu(dev->tstats);
> +   bitmap_free(tun->af_xdp_zc_qps);
> return err;
> }
> return 0;
> @@ -1222,11 +1239,77 @@ static int tun_xdp_set(struct net_device *dev, struct 
> bpf_prog *prog,
> return 0;
>  }
>
> +static int tun_xsk_pool_enable(struct net_device *netdev,
> +  struct xsk_buff_pool *pool,
> +  u16 qid)
> +{
> +   struct tun_struct *tun = netdev_priv(netdev);
> +   struct tun_file *tfile;
> +   unsigned long flags;
> +
> +   rcu_read_lock();
> +   tfile = rtnl_dereference(tun->tfiles[qid]);
> +   if (!tfile) {
> +   rcu_read_unlock();
> +   return -ENODEV;
> +   }
> +
> +   spin_lock_irqsave(>pool_lock, flags);
> +   xsk_pool_set_rxq_info(pool, >xdp_rxq);
> +   tfile->pool = pool;
> +   spin_unlock_irqrestore(>pool_lock, flags);
> +
> +   rcu_read_unlock();
> +   set_bit(qid, tun->af_xdp_zc_qps);
> +
> +   return 0;
> +}
> +
> +static int tun_xsk_pool_disable(struct net_device *netdev, u16 qid)
> +{
> +   struct tun_struct *tun = netdev_priv(netdev);
> +   struct tun_file *tfile;
> +   unsigned long flags;
> +
> +   if (!test_bit(qid, tun->af_xdp_zc_qps))
> +   return 0;
> +
> +   clear_bit(qid, tun->af_xdp_zc_qps);
> +
> +   rcu_read_lock();
> +   tfile = rtnl_dereference(tun->tfiles[qid]);
> +   if (!tfile) {
> +   rcu_read_unlock();
> +   return 0;
> +   }
> +
> +   spin_lock_irqsave(>pool_lock, flags);
> +   if (tfile->desc.len) {
> +   xsk_tx_completed(tfile->pool, 1);
> +   tfile->desc.len = 0;
> +   }
> +   tfile->pool = NULL;
> +   spin_unlock_irqrestore(>pool_lock, flags);
>

Re: [PATCH v2 2/3] virtio_net: Add missing virtio header in skb for XDP_PASS

2024-01-24 Thread Jason Wang

On Wed, Jan 24, 2024 at 5:16 PM Xuan Zhuo  wrote:
>
> On Wed, 24 Jan 2024 16:57:20 +0800, Liang Chen  
> wrote:
> > For the XDP_PASS scenario of the XDP path, the skb constructed with
> > xdp_buff does not include the virtio header. Adding the virtio header
> > information back when creating the skb.
> >
> > Signed-off-by: Liang Chen 
> > ---
> >  drivers/net/virtio_net.c | 6 ++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index b56828804e5f..2de46eb4c661 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -1270,6 +1270,9 @@ static struct sk_buff *receive_small_xdp(struct 
> > net_device *dev,
> >   if (unlikely(!skb))
> >   goto err;
> >
> > + /* Store the original virtio header for subsequent use by the driver. 
> > */
> > + memcpy(skb_vnet_common_hdr(skb), _xdp.hdr, vi->hdr_len);
>
> About this, a spec is waiting for voting.
>

A pointer?

> This may change the logic of the csum offset and so on.

 Btw, doesn't it need a new feature bit?

Thanks

>
> Please not do this.
>
> Thanks.
>
>
> > +
> >   if (metasize)
> >   skb_metadata_set(skb, metasize);
> >
> > @@ -1635,6 +1638,9 @@ static struct sk_buff *receive_mergeable_xdp(struct 
> > net_device *dev,
> >   head_skb = build_skb_from_xdp_buff(dev, vi, xdp, 
> > xdp_frags_truesz);
> >   if (unlikely(!head_skb))
> >   break;
> > + /* Store the original virtio header for subsequent use by the 
> > driver. */
> > + memcpy(skb_vnet_common_hdr(head_skb), _xdp.hdr, 
> > vi->hdr_len);
> > +
> >   return head_skb;
> >
> >   case XDP_TX:
> > --
> > 2.40.1
> >
>

Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-24 Thread Jason Wang

On Thu, Jan 25, 2024 at 3:05 AM Willem de Bruijn
 wrote:
>
> Yunjian Wang wrote:
> > Now the zero-copy feature of AF_XDP socket is supported by some
> > drivers, which can reduce CPU utilization on the xdp program.
> > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> >
> > This patch tries to address this by:
> > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty.
> > So add a check for empty vq's array in vhost_net_buf_produce().
> > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > - add tun_put_user_desc function to copy the Rx data to VM
> >
> > Signed-off-by: Yunjian Wang 
>
> I don't fully understand the higher level design of this feature yet.
>
> But some initial comments at the code level.
>
> > ---
> >  drivers/net/tun.c   | 165 +++-
> >  drivers/vhost/net.c |  18 +++--
> >  2 files changed, 176 insertions(+), 7 deletions(-)
> >

[...]

> >  struct tun_page {
> > @@ -208,6 +21
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index f2ed7167c848..a1f143ad2341 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
>
> For virtio maintainer: is it okay to have tun and vhost/net changes in
> the same patch, or is it better to split them?

It's better to split, but as you comment below, if it must be done in
one patch we need to explain why.

>
> > @@ -169,9 +169,10 @@ static int vhost_net_buf_is_empty(struct vhost_net_buf 
> > *rxq)
> >
> >  static void *vhost_net_buf_consume(struct vhost_net_buf *rxq)
> >  {
> > - void *ret = vhost_net_buf_get_ptr(rxq);
> > - ++rxq->head;
> > - return ret;
> > + if (rxq->tail == rxq->head)
> > + return NULL;
> > +
> > + return rxq->queue[rxq->head++];
>
> Why this change?

Thanks

Re: [PATCH] virtio_net: Support RX hash XDP hint

2024-01-23 Thread Jason Wang

On Mon, Jan 22, 2024 at 6:23 PM Liang Chen  wrote:
>
> The RSS hash report is a feature that's part of the virtio specification.
> Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> (still a work in progress as per [1]) support this feature. While the
> capability to obtain the RSS hash has been enabled in the normal path,
> it's currently missing in the XDP path. Therefore, we are introducing XDP
> hints through kfuncs to allow XDP programs to access the RSS hash.
>
> Signed-off-by: Liang Chen 
> ---
>  drivers/net/virtio_net.c | 56 
>  1 file changed, 56 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d7ce4a1011ea..1463a4709e3c 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -4579,6 +4579,60 @@ static void virtnet_set_big_packets(struct 
> virtnet_info *vi, const int mtu)
> }
>  }
>
> +static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
> +  enum xdp_rss_hash_type *rss_type)
> +{
> +   const struct xdp_buff *xdp = (void *)_ctx;
> +   struct virtio_net_hdr_v1_hash *hdr_hash;
> +   struct virtnet_info *vi;
> +
> +   if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
> +   return -ENODATA;
> +
> +   vi = netdev_priv(xdp->rxq->dev);
> +   hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - vi->hdr_len);

Is there a guarantee that the hdr is not modified?

Thanks

Re: [PATCH V1] vdpa_sim: reset must not run

2024-01-21 Thread Jason Wang

On Thu, Jan 18, 2024 at 3:23 AM Steve Sistare  wrote:
>
> vdpasim_do_reset sets running to true, which is wrong, as it allows
> vdpasim_kick_vq to post work requests before the device has been
> configured.  To fix, do not set running until VIRTIO_CONFIG_S_FEATURES_OK
> is set.
>
> Fixes: 0c89e2a3a9d0 ("vdpa_sim: Implement suspend vdpa op")
> Signed-off-by: Steve Sistare 
> Reviewed-by: Eugenio Pérez 

Acked-by: Jason Wang 

Thanks

Re: [RFC V1 00/13] vdpa live update

2024-01-21 Thread Jason Wang

On Thu, Jan 18, 2024 at 4:32 AM Steven Sistare
 wrote:
>
> On 1/10/2024 9:55 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare  
> > wrote:
> >>
> >> Live update is a technique wherein an application saves its state, exec's
> >> to an updated version of itself, and restores its state.  Clients of the
> >> application experience a brief suspension of service, on the order of
> >> 100's of milliseconds, but are otherwise unaffected.
> >>
> >> Define and implement interfaces that allow vdpa devices to be preserved
> >> across fork or exec, to support live update for applications such as qemu.
> >> The device must be suspended during the update, but its dma mappings are
> >> preserved, so the suspension is brief.
> >>
> >> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> >> accounting from one process to another.
> >>
> >> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> >> VHOST_NEW_OWNER is supported.
> >>
> >> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
> >> address in the new process.
> >>
> >> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> >> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
> >> require it, because the userland address of each dma mapping is discarded
> >> after being translated to a physical address.
> >>
> >> Here is a pseudo-code sequence for performing live update, based on
> >> suspend + reset because resume is not yet available.  The vdpa device
> >> descriptor, fd, remains open across the exec.
> >>
> >>   ioctl(fd, VHOST_VDPA_SUSPEND)
> >>   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
> >>   exec
> >
> > Is there a userspace implementation as a reference?
>
> I have working patches for qemu that use these ioctl's, but they depend on 
> other
> qemu cpr patches that are a work in progress, and not posted yet.  I'm 
> working on
> that.

Ok.

>
> >>   ioctl(fd, VHOST_NEW_OWNER)
> >>
> >>   issue ioctls to re-create vrings
> >>
> >>   if VHOST_BACKEND_F_IOTLB_REMAP
> >>   foreach dma mapping
> >>   write(fd, {VHOST_IOTLB_REMAP, new_addr})
> >
> > I think I need to understand the advantages of this approach. For
> > example, why it is better than
> >
> > ioctl(VHOST_RESET_OWNER)
> > exec
> >
> > ioctl(VHOST_SET_OWNER)
> >
> > for each dma mapping
> >  ioctl(VHOST_IOTLB_UPDATE)
>
> That is slower.  VHOST_RESET_OWNER unbinds physical pages, and 
> VHOST_IOTLB_UPDATE
> rebinds them.  It costs multiple seconds for large memories, and is incurred 
> during the
> virtual machine's pause time during live update.  For comparison, the total 
> pause time
> for live update with vfio interfaces is ~100 millis.
>
> However, the interaction with userland is so similar that the same code paths 
> can be used.
> In my qemu prototype, after cpr exec's new qemu:
>   - vhost_vdpa_set_owner() calls VHOST_NEW_OWNER instead of VHOST_SET_OWNER
>   - vhost_vdpa_dma_map() sets type VHOST_IOTLB_REMAP instead of 
> VHOST_IOTLB_UPDATE
>
> - Steve
>

Ok, let's document this in the changlog.

Thanks

Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP

2024-01-21 Thread Jason Wang

On Thu, Jan 18, 2024 at 4:32 AM Steven Sistare
 wrote:
>
> On 1/10/2024 10:08 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare  
> > wrote:
> >>
> >> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> >> some devices need to know the new userland addresses of the dma mappings.
> >> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> >> of a mapping.  The new uaddr must address the same memory object as
> >> originally mapped.
> >>
> >> The user must suspend the device before the old address is invalidated,
> >> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> >> requirement is not enforced by the API.
> >>
> >> Signed-off-by: Steve Sistare 
> >> ---
> >>  drivers/vhost/vdpa.c | 34 
> >>  include/uapi/linux/vhost_types.h | 11 ++-
> >>  2 files changed, 44 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index faed6471934a..ec5ca20bd47d 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
> >>
> >>  }
> >>
> >> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> >> + struct vhost_iotlb *iotlb,
> >> + struct vhost_iotlb_msg *msg)
> >> +{
> >> +   struct vdpa_device *vdpa = v->vdpa;
> >> +   const struct vdpa_config_ops *ops = vdpa->config;
> >> +   u32 asid = iotlb_to_asid(iotlb);
> >> +   u64 start = msg->iova;
> >> +   u64 last = start + msg->size - 1;
> >> +   struct vhost_iotlb_map *map;
> >> +   int r = 0;
> >> +
> >> +   if (msg->perm || !msg->size)
> >> +   return -EINVAL;
> >> +
> >> +   map = vhost_iotlb_itree_first(iotlb, start, last);
> >> +   if (!map)
> >> +   return -ENOENT;
> >> +
> >> +   if (map->start != start || map->last != last)
> >> +   return -EINVAL;
> >> +
> >> +   /* batch will finish with remap.  non-batch must do it now. */
> >> +   if (!v->in_batch)
> >> +   r = ops->set_map(vdpa, asid, iotlb);
> >> +   if (!r)
> >> +   map->addr = msg->uaddr;
> >
> > I may miss something, for example for PA mapping,
> >
> > 1) need to convert uaddr into phys addr
> > 2) need to check whether the uaddr is backed by the same page or not?
>
> This code does not verify that the new size@uaddr points to the same physical
> pages as the old size@uaddr.  If the app screws up and they differ, then the 
> app
> may corrupt its own memory, but no-one else's.
>
> It would be expensive for large memories to verify page by page, O(npages), 
> and such
> verification lies on the critical path for virtual machine downtime during 
> live update.
> I could compare the properties of the vma(s) for the old size@uaddr vs the 
> vma for the
> new, but that is more complicated and would be a maintenance headache.  When 
> I submitted
> such code to Alex W when writing the equivalent patches for vfio, he said 
> don't check,
> correctness is the user's responsibility.

Ok, let's document this somewhere.

Thanks

>
> - Steve
>
> >> +
> >> +   return r;
> >> +}
> >> +
> >>  static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
> >>struct vhost_iotlb *iotlb,
> >>struct vhost_iotlb_msg *msg)
> >> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct 
> >> vhost_dev *dev, u32 asid,
> >> ops->set_map(vdpa, asid, iotlb);
> >> v->in_batch = false;
> >> break;
> >> +   case VHOST_IOTLB_REMAP:
> >> +   r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> >> +   break;
> >> default:
> >> r = -EINVAL;
> >> break;
> >> diff --git a/include/uapi/linux/vhost_types.h 
> >> b/include/uapi/linux/vhost_types.h
> >> index 9177843951e9..35908315ff55 100644
> >> --- a/include/uapi/linux/vhost_t

Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend

2024-01-11 Thread Jason Wang

On Fri, Jan 12, 2024 at 12:18 AM Mike Christie
 wrote:
>
> On 1/10/24 9:09 PM, Jason Wang wrote:
> > On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare  
> > wrote:
> >>
> >> To pass ownership of a live vdpa device to a new process, the user
> >> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
> >> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
> >> mm.  Flush workers in suspend to guarantee that no worker sees the new
> >> mm and old VA in between.
> >>
> >> Signed-off-by: Steve Sistare 
> >> ---
> >>  drivers/vhost/vdpa.c | 4 
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> >> index 8fe1562d24af..9673e8e20d11 100644
> >> --- a/drivers/vhost/vdpa.c
> >> +++ b/drivers/vhost/vdpa.c
> >> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
> >>  {
> >> struct vdpa_device *vdpa = v->vdpa;
> >> const struct vdpa_config_ops *ops = vdpa->config;
> >> +   struct vhost_dev *vdev = >vdev;
> >>
> >> if (!ops->suspend)
> >> return -EOPNOTSUPP;
> >>
> >> +   if (vdev->use_worker)
> >> +   vhost_dev_flush(vdev);
> >
> > It looks to me like it's better to check use_woker in vhost_dev_flush.
> >
>
> You can now just call vhost_dev_flush and it will do the right thing.
> The xa_for_each loop will only flush workers if they have been setup,
> so for vdpa it will not find/flush anything.

Right.

Thanks

>
>
>

Re: [PATCH] driver/virtio: Add Memory Balloon Support for SEV/SEV-ES

2024-01-10 Thread Jason Wang

On Wed, Jan 10, 2024 at 2:23 PM Zheyun Shen  wrote:
>
> For now, SEV pins guest's memory to avoid swapping or
> moving ciphertext, but leading to the inhibition of
> Memory Ballooning.
>
> In Memory Ballooning, only guest's free pages will be relocated
> in balloon inflation and deflation, so the difference of plaintext
> doesn't matter to guest.

This seems only true if the page is zeroed, is this true here?

Thanks

Re: [PATCH v7 2/3] vduse: Temporarily fail if control queue feature requested

2024-01-10 Thread Jason Wang

On Tue, Jan 9, 2024 at 7:10 PM Maxime Coquelin
 wrote:
>
> Virtio-net driver control queue implementation is not safe
> when used with VDUSE. If the VDUSE application does not
> reply to control queue messages, it currently ends up
> hanging the kernel thread sending this command.
>
> Some work is on-going to make the control queue
> implementation robust with VDUSE. Until it is completed,
> let's fail features check if control-queue feature is
> requested.
>
> Signed-off-by: Maxime Coquelin 

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index a5af6d4077b8..00f3f562ab5d 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -8,6 +8,7 @@
>   *
>   */
>
> +#include "linux/virtio_net.h"
>  #include 
>  #include 
>  #include 
> @@ -28,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  #include "iova_domain.h"
> @@ -1680,6 +1682,9 @@ static bool features_is_valid(struct vduse_dev_config 
> *config)
> if ((config->device_id == VIRTIO_ID_BLOCK) &&
> (config->features & BIT_ULL(VIRTIO_BLK_F_CONFIG_WCE)))
> return false;
> +   else if ((config->device_id == VIRTIO_ID_NET) &&
> +   (config->features & BIT_ULL(VIRTIO_NET_F_CTRL_VQ)))
> +   return false;
>
> return true;
>  }
> --
> 2.43.0
>

Re: [RFC V1 08/13] vduse: flush workers on suspend

2024-01-10 Thread Jason Wang

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare  wrote:
>
> To pass ownership of a live vdpa device to a new process, the user
> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
> mm.  Flush workers in suspend to guarantee that no worker sees the new
> mm and old VA in between.
>
> Signed-off-by: Steve Sistare 

It seems we need a better title, probably "suspend support for vduse"?
And it looks better to be an separate patch.

Thanks

Re: [RFC V1 07/13] vhost-vdpa: flush workers on suspend

2024-01-10 Thread Jason Wang

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare  wrote:
>
> To pass ownership of a live vdpa device to a new process, the user
> suspends the device, calls VHOST_NEW_OWNER to change the mm, and calls
> VHOST_IOTLB_REMAP to change the user virtual addresses to match the new
> mm.  Flush workers in suspend to guarantee that no worker sees the new
> mm and old VA in between.
>
> Signed-off-by: Steve Sistare 
> ---
>  drivers/vhost/vdpa.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 8fe1562d24af..9673e8e20d11 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -591,10 +591,14 @@ static long vhost_vdpa_suspend(struct vhost_vdpa *v)
>  {
> struct vdpa_device *vdpa = v->vdpa;
> const struct vdpa_config_ops *ops = vdpa->config;
> +   struct vhost_dev *vdev = >vdev;
>
> if (!ops->suspend)
> return -EOPNOTSUPP;
>
> +   if (vdev->use_worker)
> +   vhost_dev_flush(vdev);

It looks to me like it's better to check use_woker in vhost_dev_flush.

Thanks


> +
> return ops->suspend(vdpa);
>  }
>
> --
> 2.39.3
>

Re: [RFC V1 05/13] vhost-vdpa: VHOST_IOTLB_REMAP

2024-01-10 Thread Jason Wang

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare  wrote:
>
> When device ownership is passed to a new process via VHOST_NEW_OWNER,
> some devices need to know the new userland addresses of the dma mappings.
> Define the new iotlb message type VHOST_IOTLB_REMAP to update the uaddr
> of a mapping.  The new uaddr must address the same memory object as
> originally mapped.
>
> The user must suspend the device before the old address is invalidated,
> and cannot resume it until after VHOST_IOTLB_REMAP is called, but this
> requirement is not enforced by the API.
>
> Signed-off-by: Steve Sistare 
> ---
>  drivers/vhost/vdpa.c | 34 
>  include/uapi/linux/vhost_types.h | 11 ++-
>  2 files changed, 44 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index faed6471934a..ec5ca20bd47d 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -1219,6 +1219,37 @@ static int vhost_vdpa_pa_map(struct vhost_vdpa *v,
>
>  }
>
> +static int vhost_vdpa_process_iotlb_remap(struct vhost_vdpa *v,
> + struct vhost_iotlb *iotlb,
> + struct vhost_iotlb_msg *msg)
> +{
> +   struct vdpa_device *vdpa = v->vdpa;
> +   const struct vdpa_config_ops *ops = vdpa->config;
> +   u32 asid = iotlb_to_asid(iotlb);
> +   u64 start = msg->iova;
> +   u64 last = start + msg->size - 1;
> +   struct vhost_iotlb_map *map;
> +   int r = 0;
> +
> +   if (msg->perm || !msg->size)
> +   return -EINVAL;
> +
> +   map = vhost_iotlb_itree_first(iotlb, start, last);
> +   if (!map)
> +   return -ENOENT;
> +
> +   if (map->start != start || map->last != last)
> +   return -EINVAL;
> +
> +   /* batch will finish with remap.  non-batch must do it now. */
> +   if (!v->in_batch)
> +   r = ops->set_map(vdpa, asid, iotlb);
> +   if (!r)
> +   map->addr = msg->uaddr;

I may miss something, for example for PA mapping,

1) need to convert uaddr into phys addr
2) need to check whether the uaddr is backed by the same page or not?

Thanks

> +
> +   return r;
> +}
> +
>  static int vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,
>struct vhost_iotlb *iotlb,
>struct vhost_iotlb_msg *msg)
> @@ -1298,6 +1329,9 @@ static int vhost_vdpa_process_iotlb_msg(struct 
> vhost_dev *dev, u32 asid,
> ops->set_map(vdpa, asid, iotlb);
> v->in_batch = false;
> break;
> +   case VHOST_IOTLB_REMAP:
> +   r = vhost_vdpa_process_iotlb_remap(v, iotlb, msg);
> +   break;
> default:
> r = -EINVAL;
> break;
> diff --git a/include/uapi/linux/vhost_types.h 
> b/include/uapi/linux/vhost_types.h
> index 9177843951e9..35908315ff55 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -79,7 +79,7 @@ struct vhost_iotlb_msg {
>  /*
>   * VHOST_IOTLB_BATCH_BEGIN and VHOST_IOTLB_BATCH_END allow modifying
>   * multiple mappings in one go: beginning with
> - * VHOST_IOTLB_BATCH_BEGIN, followed by any number of
> + * VHOST_IOTLB_BATCH_BEGIN, followed by any number of VHOST_IOTLB_REMAP or
>   * VHOST_IOTLB_UPDATE messages, and ending with VHOST_IOTLB_BATCH_END.
>   * When one of these two values is used as the message type, the rest
>   * of the fields in the message are ignored. There's no guarantee that
> @@ -87,6 +87,15 @@ struct vhost_iotlb_msg {
>   */
>  #define VHOST_IOTLB_BATCH_BEGIN5
>  #define VHOST_IOTLB_BATCH_END  6
> +
> +/*
> + * VHOST_IOTLB_REMAP registers a new uaddr for the existing mapping at iova.
> + * The new uaddr must address the same memory object as originally mapped.
> + * Failure to do so will result in user memory corruption and/or device
> + * misbehavior.  iova and size must match the arguments used to create the
> + * an existing mapping.  Protection is not changed, and perm must be 0.
> + */
> +#define VHOST_IOTLB_REMAP  7
> __u8 type;
>  };
>
> --
> 2.39.3
>

Re: [RFC V1 00/13] vdpa live update

2024-01-10 Thread Jason Wang

On Thu, Jan 11, 2024 at 4:40 AM Steve Sistare  wrote:
>
> Live update is a technique wherein an application saves its state, exec's
> to an updated version of itself, and restores its state.  Clients of the
> application experience a brief suspension of service, on the order of
> 100's of milliseconds, but are otherwise unaffected.
>
> Define and implement interfaces that allow vdpa devices to be preserved
> across fork or exec, to support live update for applications such as qemu.
> The device must be suspended during the update, but its dma mappings are
> preserved, so the suspension is brief.
>
> The VHOST_NEW_OWNER ioctl transfers device ownership and pinned memory
> accounting from one process to another.
>
> The VHOST_BACKEND_F_NEW_OWNER backend capability indicates that
> VHOST_NEW_OWNER is supported.
>
> The VHOST_IOTLB_REMAP message type updates a dma mapping with its userland
> address in the new process.
>
> The VHOST_BACKEND_F_IOTLB_REMAP backend capability indicates that
> VHOST_IOTLB_REMAP is supported and required.  Some devices do not
> require it, because the userland address of each dma mapping is discarded
> after being translated to a physical address.
>
> Here is a pseudo-code sequence for performing live update, based on
> suspend + reset because resume is not yet available.  The vdpa device
> descriptor, fd, remains open across the exec.
>
>   ioctl(fd, VHOST_VDPA_SUSPEND)
>   ioctl(fd, VHOST_VDPA_SET_STATUS, 0)
>   exec

Is there a userspace implementation as a reference?

>
>   ioctl(fd, VHOST_NEW_OWNER)
>
>   issue ioctls to re-create vrings
>
>   if VHOST_BACKEND_F_IOTLB_REMAP
>   foreach dma mapping
>   write(fd, {VHOST_IOTLB_REMAP, new_addr})

I think I need to understand the advantages of this approach. For
example, why it is better than

ioctl(VHOST_RESET_OWNER)
exec

ioctl(VHOST_SET_OWNER)

for each dma mapping
 ioctl(VHOST_IOTLB_UPDATE)

Thanks

>
>   ioctl(fd, VHOST_VDPA_SET_STATUS,
> ACKNOWLEDGE | DRIVER | FEATURES_OK | DRIVER_OK)
>
>
> Steve Sistare (13):
>   vhost-vdpa: count pinned memory
>   vhost-vdpa: pass mm to bind
>   vhost-vdpa: VHOST_NEW_OWNER
>   vhost-vdpa: VHOST_BACKEND_F_NEW_OWNER
>   vhost-vdpa: VHOST_IOTLB_REMAP
>   vhost-vdpa: VHOST_BACKEND_F_IOTLB_REMAP
>   vhost-vdpa: flush workers on suspend
>   vduse: flush workers on suspend
>   vdpa_sim: reset must not run
>   vdpa_sim: flush workers on suspend
>   vdpa/mlx5: new owner capability
>   vdpa_sim: new owner capability
>   vduse: new owner capability
>
>  drivers/vdpa/mlx5/net/mlx5_vnet.c  |   3 +-
>  drivers/vdpa/vdpa_sim/vdpa_sim.c   |  24 ++-
>  drivers/vdpa/vdpa_user/vduse_dev.c |  32 +
>  drivers/vhost/vdpa.c   | 101 +++--
>  drivers/vhost/vhost.c  |  15 +
>  drivers/vhost/vhost.h  |   1 +
>  include/uapi/linux/vhost.h |  10 +++
>  include/uapi/linux/vhost_types.h   |  15 -
>  8 files changed, 191 insertions(+), 10 deletions(-)
>
> --
> 2.39.3
>

Re: [PATCH v6 2/3] vduse: Temporarily fail if control queue features requested

2024-01-07 Thread Jason Wang

On Fri, Jan 5, 2024 at 6:14 PM Maxime Coquelin
 wrote:
>
>
>
> On 1/5/24 10:59, Eugenio Perez Martin wrote:
> > On Fri, Jan 5, 2024 at 9:12 AM Maxime Coquelin
> >  wrote:
> >>
> >>
> >>
> >> On 1/5/24 03:45, Jason Wang wrote:
> >>> On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin
> >>>  wrote:
> >>>>
> >>>> Virtio-net driver control queue implementation is not safe
> >>>> when used with VDUSE. If the VDUSE application does not
> >>>> reply to control queue messages, it currently ends up
> >>>> hanging the kernel thread sending this command.
> >>>>
> >>>> Some work is on-going to make the control queue
> >>>> implementation robust with VDUSE. Until it is completed,
> >>>> let's fail features check if any control-queue related
> >>>> feature is requested.
> >>>>
> >>>> Signed-off-by: Maxime Coquelin 
> >>>> ---
> >>>>drivers/vdpa/vdpa_user/vduse_dev.c | 13 +
> >>>>1 file changed, 13 insertions(+)
> >>>>
> >>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> >>>> b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>> index 0486ff672408..94f54ea2eb06 100644
> >>>> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> >>>> @@ -28,6 +28,7 @@
> >>>>#include 
> >>>>#include 
> >>>>#include 
> >>>> +#include 
> >>>>#include 
> >>>>
> >>>>#include "iova_domain.h"
> >>>> @@ -46,6 +47,15 @@
> >>>>
> >>>>#define IRQ_UNBOUND -1
> >>>>
> >>>> +#define VDUSE_NET_INVALID_FEATURES_MASK \
> >>>> +   (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\
> >>>> +BIT_ULL(VIRTIO_NET_F_CTRL_RX)   |  \
> >>>> +BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) |  \
> >>>> +BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> >>>> +BIT_ULL(VIRTIO_NET_F_MQ) | \
> >>>> +BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) |  \
> >>>> +BIT_ULL(VIRTIO_NET_F_RSS))
> >>>
> >>> We need to make this as well:
> >>>
> >>> VIRTIO_NET_F_CTRL_GUEST_OFFLOADS
> >>
> >> I missed it, and see others have been added in the Virtio spec
> >> repository (BTW, I see this specific one is missing in the dependency
> >> list [0], I will submit a patch).
> >>
> >> I wonder if it is not just simpler to just check for
> >> VIRTIO_NET_F_CTRL_VQ is requested. As we fail instead of masking out,
> >> the VDUSE driver won't be the one violating the spec so it should be
> >> good?
> >>
> >> It will avoid having to update the mask if new features depending on it
> >> are added (or forgetting to update it).
> >>
> >> WDYT?
> >>
> >
> > I think it is safer to work with a whitelist, instead of a blacklist.
> > As any new feature might require code changes in QEMU. Is that
> > possible?
>
> Well, that's how it was done in previous revision. :)
>
> I changed to a blacklist for consistency with block device's WCE feature
> check after Jason's comment.
>
> I'm not sure moving back to a whitelist brings much advantages when
> compared to the effort of keeping it up to date. Just blacklisting
> VIRTIO_NET_F_CTRL_VQ is enough in my opinion.

I think this makes sense.

Thanks

>
> Thanks,
> Maxime
>
> >> Thanks,
> >> Maxime
> >>
> >> [0]:
> >> https://github.com/oasis-tcs/virtio-spec/blob/5fc35a7efb903fc352da81a6d2be5c01810b68d3/device-types/net/description.tex#L129
> >>> Other than this,
> >>>
> >>> Acked-by: Jason Wang 
> >>>
> >>> Thanks
> >>>
> >>>> +
> >>>>struct vduse_virtqueue {
> >>>>   u16 index;
> >>>>   u16 num_max;
> >>>> @@ -1680,6 +1690,9 @@ static bool features_is_valid(struct 
> >>>> vduse_dev_config *config)
> >>>>   if ((config->device_id == VIRTIO_ID_BLOCK) &&
> >>>>   (config->features & (1ULL << 
> >>>> VIRTIO_BLK_F_CONFIG_WCE)))
> >>>>   return false;
> >>>> +   else if ((config->device_id == VIRTIO_ID_NET) &&
> >>>> +   (config->features & 
> >>>> VDUSE_NET_INVALID_FEATURES_MASK))
> >>>> +   return false;
> >>>>
> >>>>   return true;
> >>>>}
> >>>> --
> >>>> 2.43.0
> >>>>
> >>>
> >>
> >>
> >
>

Re: [PATCH v6 3/3] vduse: enable Virtio-net device type

2024-01-04 Thread Jason Wang

On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin
 wrote:
>
> This patch adds Virtio-net device type to the supported
> devices types.
>
> Initialization fails if the device does not support
> VIRTIO_F_VERSION_1 feature, in order to guarantee the
> configuration space is read-only. It also fails with
> -EPERM if the CAP_NET_ADMIN is missing.
>
> Signed-off-by: Maxime Coquelin 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH v6 2/3] vduse: Temporarily fail if control queue features requested

2024-01-04 Thread Jason Wang

On Thu, Jan 4, 2024 at 11:38 PM Maxime Coquelin
 wrote:
>
> Virtio-net driver control queue implementation is not safe
> when used with VDUSE. If the VDUSE application does not
> reply to control queue messages, it currently ends up
> hanging the kernel thread sending this command.
>
> Some work is on-going to make the control queue
> implementation robust with VDUSE. Until it is completed,
> let's fail features check if any control-queue related
> feature is requested.
>
> Signed-off-by: Maxime Coquelin 
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 0486ff672408..94f54ea2eb06 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  #include "iova_domain.h"
> @@ -46,6 +47,15 @@
>
>  #define IRQ_UNBOUND -1
>
> +#define VDUSE_NET_INVALID_FEATURES_MASK \
> +   (BIT_ULL(VIRTIO_NET_F_CTRL_VQ) |\
> +BIT_ULL(VIRTIO_NET_F_CTRL_RX)   |  \
> +BIT_ULL(VIRTIO_NET_F_CTRL_VLAN) |  \
> +BIT_ULL(VIRTIO_NET_F_GUEST_ANNOUNCE) | \
> +BIT_ULL(VIRTIO_NET_F_MQ) | \
> +BIT_ULL(VIRTIO_NET_F_CTRL_MAC_ADDR) |  \
> +    BIT_ULL(VIRTIO_NET_F_RSS))

We need to make this as well:

VIRTIO_NET_F_CTRL_GUEST_OFFLOADS

Other than this,

Acked-by: Jason Wang 

Thanks

> +
>  struct vduse_virtqueue {
> u16 index;
> u16 num_max;
> @@ -1680,6 +1690,9 @@ static bool features_is_valid(struct vduse_dev_config 
> *config)
> if ((config->device_id == VIRTIO_ID_BLOCK) &&
> (config->features & (1ULL << 
> VIRTIO_BLK_F_CONFIG_WCE)))
> return false;
> +   else if ((config->device_id == VIRTIO_ID_NET) &&
> +   (config->features & VDUSE_NET_INVALID_FEATURES_MASK))
> +   return false;
>
> return true;
>  }
> --
> 2.43.0
>

Re: [PATCH v4] virtio_pmem: support feature SHMEM_REGION

2023-12-26 Thread Jason Wang

On Thu, Dec 21, 2023 at 4:49 AM Changyuan Lyu  wrote:
>
> Thanks Michael for the feedback!
>
> On Tue, Dec 19, 2023 at 11:44 PM Michael S. Tsirkin  wrote:
> >
> > > On Tue, Dec 19, 2023 at 11:32:27PM -0800, Changyuan Lyu wrote:
> > >
> > > +   if (!have_shm) {
> > > +   dev_err(>dev, "failed to get shared memory 
> > > region %d\n",
> > > +   VIRTIO_PMEM_SHMEM_REGION_ID);
> > > +   err = -ENXIO;
> > > +   goto out_vq;
> > > +   }
> >
> > Maybe additionally, add a validate callback and clear
> > VIRTIO_PMEM_F_SHMEM_REGION if VIRTIO_PMEM_SHMEM_REGION_ID is not there.
>
> Done.
>
> > > +/* Feature bits */
> > > +#define VIRTIO_PMEM_F_SHMEM_REGION 0   /* guest physical address 
> > > range will be
> > > +* indicated as shared memory region 0
> > > +*/
> >
> > Either make this comment shorter to fit in one line, or put the
> > multi-line comment before the define.
>
> Done.
>
> ---8<---
>
> This patch adds the support for feature VIRTIO_PMEM_F_SHMEM_REGION
> (virtio spec v1.2 section 5.19.5.2 [1]).
>
> During feature negotiation, if VIRTIO_PMEM_F_SHMEM_REGION is offered
> by the device, the driver looks for a shared memory region of id 0.
> If it is found, this feature is understood. Otherwise, this feature
> bit is cleared.
>
> During probe, if VIRTIO_PMEM_F_SHMEM_REGION has been negotiated,
> virtio pmem ignores the `start` and `size` fields in device config
> and uses the physical address range of shared memory region 0.
>
> [1] 
> https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-6480002
>
> Signed-off-by: Changyuan Lyu 

Acked-by: Jason Wang 

Thanks

> ---
> v4:
>   * added virtio_pmem_validate callback.
> v3:
>   * updated the patch description.
> V2:
>   * renamed VIRTIO_PMEM_SHMCAP_ID to VIRTIO_PMEM_SHMEM_REGION_ID
>   * fixed the error handling when region 0 does not exist
> ---
>  drivers/nvdimm/virtio_pmem.c | 36 
>  include/uapi/linux/virtio_pmem.h |  7 +++
>  2 files changed, 39 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> index a92eb172f0e7..4ceced5cefcf 100644
> --- a/drivers/nvdimm/virtio_pmem.c
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -29,12 +29,27 @@ static int init_vq(struct virtio_pmem *vpmem)
> return 0;
>  };
>
> +static int virtio_pmem_validate(struct virtio_device *vdev)
> +{
> +   struct virtio_shm_region shm_reg;
> +
> +   if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION) &&
> +   !virtio_get_shm_region(vdev, _reg, 
> (u8)VIRTIO_PMEM_SHMEM_REGION_ID)
> +   ) {
> +   dev_notice(>dev, "failed to get shared memory region 
> %d\n",
> +   VIRTIO_PMEM_SHMEM_REGION_ID);
> +   __virtio_clear_bit(vdev, VIRTIO_PMEM_F_SHMEM_REGION);
> +   }
> +   return 0;
> +}
> +
>  static int virtio_pmem_probe(struct virtio_device *vdev)
>  {
> struct nd_region_desc ndr_desc = {};
> struct nd_region *nd_region;
> struct virtio_pmem *vpmem;
> struct resource res;
> +   struct virtio_shm_region shm_reg;
> int err = 0;
>
> if (!vdev->config->get) {
> @@ -57,10 +72,16 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> goto out_err;
> }
>
> -   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> -   start, >start);
> -   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> -   size, >size);
> +   if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION)) {
> +   virtio_get_shm_region(vdev, _reg, 
> (u8)VIRTIO_PMEM_SHMEM_REGION_ID);
> +   vpmem->start = shm_reg.addr;
> +   vpmem->size = shm_reg.len;
> +   } else {
> +   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> +   start, >start);
> +   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> +   size, >size);
> +   }
>
> res.start = vpmem->start;
> res.end   = vpmem->start + vpmem->size - 1;
> @@ -122,10 +143,17 @@ static void virtio_pmem_remove(struct virtio_device 
> *vdev)
> virti

Re: [PATCH vhost v4 06/15] vdpa: Track device suspended state

2023-12-24 Thread Jason Wang

On Fri, Dec 22, 2023 at 7:22 PM Dragos Tatulea  wrote:
>
> On Wed, 2023-12-20 at 13:55 +0100, Dragos Tatulea wrote:
> > On Wed, 2023-12-20 at 11:46 +0800, Jason Wang wrote:
> > > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  
> > > wrote:
> > > >
> > > > Set vdpa device suspended state on successful suspend. Clear it on
> > > > successful resume and reset.
> > > >
> > > > The state will be locked by the vhost_vdpa mutex. The mutex is taken
> > > > during suspend, resume and reset in vhost_vdpa_unlocked_ioctl. The
> > > > exception is vhost_vdpa_open which does a device reset but that should
> > > > be safe because it can only happen before the other ops.
> > > >
> > > > Signed-off-by: Dragos Tatulea 
> > > > Suggested-by: Eugenio Pérez 
> > > > ---
> > > >  drivers/vhost/vdpa.c | 17 +++--
> > > >  1 file changed, 15 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> > > > index b4e8ddf86485..00b4fa8e89f2 100644
> > > > --- a/drivers/vhost/vdpa.c
> > > > +++ b/drivers/vhost/vdpa.c
> > > > @@ -59,6 +59,7 @@ struct vhost_vdpa {
> > > > int in_batch;
> > > > struct vdpa_iova_range range;
> > > > u32 batch_asid;
> > > > +   bool suspended;
> > >
> > > Any reason why we don't do it in the core vDPA device but here?
> > >
> > Not really. I wanted to be safe and not expose it in a header due to 
> > locking.
> >
> A few clearer answers for why the state is not added in struct vdpa_device:
> - All the suspend infrastructure is currently only for vhost.
> - If the state would be moved to struct vdpa_device then the cf_lock would 
> have
> to be used. This adds more complexity to the code.
>
> Thanks,
> Dragos

Ok, I'm fine with that.

Thanks

Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-21 Thread Jason Wang

On Thu, Dec 21, 2023 at 3:47 PM Eugenio Perez Martin
 wrote:
>
> On Thu, Dec 21, 2023 at 3:03 AM Jason Wang  wrote:
> >
> > On Wed, Dec 20, 2023 at 9:32 PM Eugenio Perez Martin
> >  wrote:
> > >
> > > On Wed, Dec 20, 2023 at 5:06 AM Jason Wang  wrote:
> > > >
> > > > On Wed, Dec 20, 2023 at 11:46 AM Jason Wang  wrote:
> > > > >
> > > > > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  
> > > > > wrote:
> > > > > >
> > > > > > The virtio spec doesn't allow changing virtqueue addresses after
> > > > > > DRIVER_OK. Some devices do support this operation when the device is
> > > > > > suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> > > > > > advertises this support as a backend features.
> > > > >
> > > > > There's an ongoing effort in virtio spec to introduce the suspend 
> > > > > state.
> > > > >
> > > > > So I wonder if it's better to just allow such behaviour?
> > > >
> > > > Actually I mean, allow drivers to modify the parameters during suspend
> > > > without a new feature.
> > > >
> > >
> > > That would be ideal, but how do userland checks if it can suspend +
> > > change properties + resume?
> >
> > As discussed, it looks to me the only device that supports suspend is
> > simulator and it supports change properties.
> >
> > E.g:
> >
> > static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> >   u64 desc_area, u64 driver_area,
> >   u64 device_area)
> > {
> > struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
> > struct vdpasim_virtqueue *vq = >vqs[idx];
> >
> > vq->desc_addr = desc_area;
> > vq->driver_addr = driver_area;
> > vq->device_addr = device_area;
> >
> > return 0;
> > }
> >
>
> So in the current kernel master it is valid to set a different vq
> address while the device is suspended in vdpa_sim. But it is not valid
> in mlx5, as the FW will not be updated in resume (Dragos, please
> correct me if I'm wrong). Both of them return success.
>
> How can we know in the destination QEMU if it is valid to suspend &
> set address? Should we handle this as a bugfix and backport the
> change?

Good point.

We probably need to do backport, this seems to be the easiest way.
Theoretically, userspace may assume this behavior (though I don't
think there would be a user that depends on the simulator).

>
> > >
> > > The only way that comes to my mind is to make sure all parents return
> > > error if userland tries to do it, and then fallback in userland.
> >
> > Yes.
> >
> > > I'm
> > > ok with that, but I'm not sure if the current master & previous kernel
> > > has a coherent behavior. Do they return error? Or return success
> > > without changing address / vq state?
> >
> > We probably don't need to worry too much here, as e.g set_vq_address
> > could fail even without suspend (just at uAPI level).
> >
>
> I don't get this, sorry. I rephrased my point with an example earlier
> in the mail.

I mean currently, VHOST_SET_VRING_ADDR can fail. So userspace should
not assume it will always succeed.

Thanks

>

Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test

2023-12-20 Thread Jason Wang

On Thu, Dec 21, 2023 at 10:48 AM Yunsheng Lin  wrote:
>
> On 2023/12/21 10:33, Jason Wang wrote:
> > On Wed, Dec 20, 2023 at 8:45 PM Yunsheng Lin  wrote:
> >>
> >> On 2023/12/12 12:35, Jason Wang wrote:>>>> +done:
> >>>>>> +   backend.fd = tun_alloc();
> >>>>>> +   assert(backend.fd >= 0);
> >>>>>> +   vdev_info_init(, features);
> >>>>>> +   vq_info_add(, 256);
> >>>>>> +   run_test(, [0], delayed, batch, reset, nbufs);
> >>>>>
> >>>>> I'd expect we are testing some basic traffic here. E.g can we use a
> >>>>> packet socket then we can test both tx and rx?
> >>>>
> >>>> Yes, only rx for tun is tested.
> >>>> Do you have an idea how to test the tx too? As I am not familar enough
> >>>> with vhost_net and tun yet.
> >>>
> >>> Maybe you can have a packet socket to bind to the tun/tap. Then you can 
> >>> test:
> >>>
> >>> 1) TAP RX: by write a packet via virtqueue through vhost_net and read
> >>> it from packet socket
> >>> 2) TAP TX:  by write via packet socket and read it from the virtqueue
> >>> through vhost_net
> >>
> >> When implementing the TAP TX by adding VHOST_NET_F_VIRTIO_NET_HDR,
> >> I found one possible use of uninitialized data in vhost_net_build_xdp().
> >>
> >> And vhost_hlen is set to sizeof(struct virtio_net_hdr_mrg_rxbuf) and
> >> sock_hlen is set to zero in vhost_net_set_features() for both tx and rx
> >> queue.
> >>
> >> For vhost_net_build_xdp() called by handle_tx_copy():
> >>
> >> The (gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) checking below may cause a
> >> read of uninitialized data if sock_hlen is zero.
> >
> > Which data is uninitialized here?
>
> The 'gso', as the sock_hlen is zero, there is no copying for:
>
>  copied = copy_page_from_iter(alloc_frag->page,
>   alloc_frag->offset +
>   offsetof(struct tun_xdp_hdr, gso),
>   sock_hlen, from);

I think you're right. This is something we need to fix.

Or we can drop VHOST_NET_F_VIRTIO_NET_HDR as we managed to survive for years:

https://patchwork.ozlabs.org/project/netdev/patch/1528429842-22835-1-git-send-email-jasow...@redhat.com/#1930760

>
> >
> >>
> >> And it seems vhost_hdr is skipped in get_tx_bufs():
> >> https://elixir.bootlin.com/linux/latest/source/drivers/vhost/net.c#L616
> >>
> >> static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
> >>struct iov_iter *from)
> >> {
> >> ...
> >> buflen += SKB_DATA_ALIGN(len + pad);
> >> alloc_frag->offset = ALIGN((u64)alloc_frag->offset, 
> >> SMP_CACHE_BYTES);
> >> if (unlikely(!vhost_net_page_frag_refill(net, buflen,
> >>  alloc_frag, GFP_KERNEL)))
> >> return -ENOMEM;
> >>
> >> buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> >> copied = copy_page_from_iter(alloc_frag->page,
> >>  alloc_frag->offset +
> >>  offsetof(struct tun_xdp_hdr, gso),
> >>  sock_hlen, from);
> >> if (copied != sock_hlen)
> >> return -EFAULT;
> >>
> >> hdr = buf;
> >> gso = >gso;
> >>
> >> if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> >> vhost16_to_cpu(vq, gso->csum_start) +
> >> vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> >> vhost16_to_cpu(vq, gso->hdr_len)) {
> >> ...
> >> }
> >>
> >> I seems the handle_tx_copy() does not handle the VHOST_NET_F_VIRTIO_NET_HDR
> >> case correctly, Or do I miss something obvious here?
> >
> > In get_tx_bufs() we did:
> >
> > *len = init_iov_iter(vq, >msg_iter, nvq->vhost_hlen, *out);
> >
> > Which covers this case?
>
> It does not seems to cover it, as the vhost_hdr is just skipped without any
> handling in get_tx_bufs():
> https://elixir.bootlin.com/linux/v6.7-rc6/source/drivers/vhost/net.c#L616

My understanding is that in this case vhost can't do more than this as
the socket doesn't know vnet_hdr.

Let's see if Michael is ok with this.

Thanks

>
> >
> > Thanks
>

Re: [PATCH net-next 6/6] tools: virtio: introduce vhost_net_test

2023-12-20 Thread Jason Wang

On Wed, Dec 20, 2023 at 8:45 PM Yunsheng Lin  wrote:
>
> On 2023/12/12 12:35, Jason Wang wrote:>>>> +done:
> >>>> +   backend.fd = tun_alloc();
> >>>> +   assert(backend.fd >= 0);
> >>>> +   vdev_info_init(, features);
> >>>> +   vq_info_add(, 256);
> >>>> +   run_test(, [0], delayed, batch, reset, nbufs);
> >>>
> >>> I'd expect we are testing some basic traffic here. E.g can we use a
> >>> packet socket then we can test both tx and rx?
> >>
> >> Yes, only rx for tun is tested.
> >> Do you have an idea how to test the tx too? As I am not familar enough
> >> with vhost_net and tun yet.
> >
> > Maybe you can have a packet socket to bind to the tun/tap. Then you can 
> > test:
> >
> > 1) TAP RX: by write a packet via virtqueue through vhost_net and read
> > it from packet socket
> > 2) TAP TX:  by write via packet socket and read it from the virtqueue
> > through vhost_net
>
> When implementing the TAP TX by adding VHOST_NET_F_VIRTIO_NET_HDR,
> I found one possible use of uninitialized data in vhost_net_build_xdp().
>
> And vhost_hlen is set to sizeof(struct virtio_net_hdr_mrg_rxbuf) and
> sock_hlen is set to zero in vhost_net_set_features() for both tx and rx
> queue.
>
> For vhost_net_build_xdp() called by handle_tx_copy():
>
> The (gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) checking below may cause a
> read of uninitialized data if sock_hlen is zero.

Which data is uninitialized here?

>
> And it seems vhost_hdr is skipped in get_tx_bufs():
> https://elixir.bootlin.com/linux/latest/source/drivers/vhost/net.c#L616
>
> static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
>struct iov_iter *from)
> {
> ...
> buflen += SKB_DATA_ALIGN(len + pad);
> alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
> if (unlikely(!vhost_net_page_frag_refill(net, buflen,
>  alloc_frag, GFP_KERNEL)))
> return -ENOMEM;
>
> buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> copied = copy_page_from_iter(alloc_frag->page,
>  alloc_frag->offset +
>  offsetof(struct tun_xdp_hdr, gso),
>  sock_hlen, from);
> if (copied != sock_hlen)
> return -EFAULT;
>
> hdr = buf;
> gso = >gso;
>
> if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
> vhost16_to_cpu(vq, gso->csum_start) +
> vhost16_to_cpu(vq, gso->csum_offset) + 2 >
> vhost16_to_cpu(vq, gso->hdr_len)) {
> ...
> }
>
> I seems the handle_tx_copy() does not handle the VHOST_NET_F_VIRTIO_NET_HDR
> case correctly, Or do I miss something obvious here?

In get_tx_bufs() we did:

*len = init_iov_iter(vq, >msg_iter, nvq->vhost_hlen, *out);

Which covers this case?

Thanks

>
> >
> > Thanks
> >
> >>
> >>>
> >>> Thanks
> >>
> >
> > .
> >
>

Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-20 Thread Jason Wang

On Wed, Dec 20, 2023 at 9:32 PM Eugenio Perez Martin
 wrote:
>
> On Wed, Dec 20, 2023 at 5:06 AM Jason Wang  wrote:
> >
> > On Wed, Dec 20, 2023 at 11:46 AM Jason Wang  wrote:
> > >
> > > On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  
> > > wrote:
> > > >
> > > > The virtio spec doesn't allow changing virtqueue addresses after
> > > > DRIVER_OK. Some devices do support this operation when the device is
> > > > suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> > > > advertises this support as a backend features.
> > >
> > > There's an ongoing effort in virtio spec to introduce the suspend state.
> > >
> > > So I wonder if it's better to just allow such behaviour?
> >
> > Actually I mean, allow drivers to modify the parameters during suspend
> > without a new feature.
> >
>
> That would be ideal, but how do userland checks if it can suspend +
> change properties + resume?

As discussed, it looks to me the only device that supports suspend is
simulator and it supports change properties.

E.g:

static int vdpasim_set_vq_address(struct vdpa_device *vdpa, u16 idx,
  u64 desc_area, u64 driver_area,
  u64 device_area)
{
struct vdpasim *vdpasim = vdpa_to_sim(vdpa);
struct vdpasim_virtqueue *vq = >vqs[idx];

vq->desc_addr = desc_area;
vq->driver_addr = driver_area;
vq->device_addr = device_area;

return 0;
}

>
> The only way that comes to my mind is to make sure all parents return
> error if userland tries to do it, and then fallback in userland.

Yes.

> I'm
> ok with that, but I'm not sure if the current master & previous kernel
> has a coherent behavior. Do they return error? Or return success
> without changing address / vq state?

We probably don't need to worry too much here, as e.g set_vq_address
could fail even without suspend (just at uAPI level).

Thanks

>

Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 11:46 AM Jason Wang  wrote:
>
> On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  wrote:
> >
> > The virtio spec doesn't allow changing virtqueue addresses after
> > DRIVER_OK. Some devices do support this operation when the device is
> > suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> > advertises this support as a backend features.
>
> There's an ongoing effort in virtio spec to introduce the suspend state.
>
> So I wonder if it's better to just allow such behaviour?

Actually I mean, allow drivers to modify the parameters during suspend
without a new feature.

Thanks

>
> Thanks
>
>

Re: [PATCH] vdpa: Fix an error handling path in eni_vdpa_probe()

2023-12-19 Thread Jason Wang

On Fri, Dec 8, 2023 at 5:14 AM Christophe JAILLET
 wrote:
>
> Le 20/10/2022 à 21:21, Christophe JAILLET a écrit :
> > After a successful vp_legacy_probe() call, vp_legacy_remove() should be
> > called in the error handling path, as already done in the remove function.
> >
> > Add the missing call.
> >
> > Fixes: e85087beedca ("eni_vdpa: add vDPA driver for Alibaba ENI")
> > Signed-off-by: Christophe JAILLET 
> > ---
> >   drivers/vdpa/alibaba/eni_vdpa.c | 6 --
> >   1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/vdpa/alibaba/eni_vdpa.c 
> > b/drivers/vdpa/alibaba/eni_vdpa.c
> > index 5a09a09cca70..cce3d1837104 100644
> > --- a/drivers/vdpa/alibaba/eni_vdpa.c
> > +++ b/drivers/vdpa/alibaba/eni_vdpa.c
> > @@ -497,7 +497,7 @@ static int eni_vdpa_probe(struct pci_dev *pdev, const 
> > struct pci_device_id *id)
> >   if (!eni_vdpa->vring) {
> >   ret = -ENOMEM;
> >   ENI_ERR(pdev, "failed to allocate virtqueues\n");
> > - goto err;
> > + goto err_remove_vp_legacy;
> >   }
> >
> >   for (i = 0; i < eni_vdpa->queues; i++) {
> > @@ -509,11 +509,13 @@ static int eni_vdpa_probe(struct pci_dev *pdev, const 
> > struct pci_device_id *id)
> >   ret = vdpa_register_device(_vdpa->vdpa, eni_vdpa->queues);
> >   if (ret) {
> >   ENI_ERR(pdev, "failed to register to vdpa bus\n");
> > - goto err;
> > + goto err_remove_vp_legacy;
> >   }
> >
> >   return 0;
> >
> > +err_remove_vp_legacy:
> > + vp_legacy_remove(_vdpa->ldev);
> >   err:
> >   put_device(_vdpa->vdpa.dev);
> >   return ret;
>
> Polite reminder on a (very) old patch.

Acked-by: Jason Wang 

Thanks

>
> CJ
>

Re: [PATCH] virtio_pmem: support feature SHMEM_REGION

2023-12-19 Thread Jason Wang

On Tue, Dec 19, 2023 at 3:19 PM Changyuan Lyu  wrote:
>
> As per virtio spec 1.2 section 5.19.5.2, if the feature
> VIRTIO_PMEM_F_SHMEM_REGION has been negotiated, the driver MUST query
> shared memory ID 0 for the physical address ranges.
>
> Signed-off-by: Changyuan Lyu 
> ---
>  drivers/nvdimm/virtio_pmem.c | 29 +
>  include/uapi/linux/virtio_pmem.h |  8 
>  2 files changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/nvdimm/virtio_pmem.c b/drivers/nvdimm/virtio_pmem.c
> index a92eb172f0e7..5b28d543728b 100644
> --- a/drivers/nvdimm/virtio_pmem.c
> +++ b/drivers/nvdimm/virtio_pmem.c
> @@ -35,6 +35,8 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> struct nd_region *nd_region;
> struct virtio_pmem *vpmem;
> struct resource res;
> +   struct virtio_shm_region shm_reg;
> +   bool have_shm;
> int err = 0;
>
> if (!vdev->config->get) {
> @@ -57,10 +59,23 @@ static int virtio_pmem_probe(struct virtio_device *vdev)
> goto out_err;
> }
>
> -   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> -   start, >start);
> -   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> -   size, >size);
> +   if (virtio_has_feature(vdev, VIRTIO_PMEM_F_SHMEM_REGION)) {
> +   have_shm = virtio_get_shm_region(vdev, _reg,
> +   (u8)VIRTIO_PMEM_SHMCAP_ID);
> +   if (!have_shm) {
> +   dev_err(>dev, "failed to get shared memory 
> region %d\n",
> +   VIRTIO_PMEM_SHMCAP_ID);
> +   return -EINVAL;
> +   }
> +   vpmem->start = shm_reg.addr;
> +   vpmem->size = shm_reg.len;
> +   } else {
> +   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> +   start, >start);
> +   virtio_cread_le(vpmem->vdev, struct virtio_pmem_config,
> +   size, >size);
> +   }
> +
>
> res.start = vpmem->start;
> res.end   = vpmem->start + vpmem->size - 1;
> @@ -122,7 +137,13 @@ static void virtio_pmem_remove(struct virtio_device 
> *vdev)
> virtio_reset_device(vdev);
>  }
>
> +static unsigned int features[] = {
> +   VIRTIO_PMEM_F_SHMEM_REGION,
> +};
> +
>  static struct virtio_driver virtio_pmem_driver = {
> +   .feature_table  = features,
> +   .feature_table_size = ARRAY_SIZE(features),
> .driver.name= KBUILD_MODNAME,
> .driver.owner   = THIS_MODULE,
> .id_table   = id_table,
> diff --git a/include/uapi/linux/virtio_pmem.h 
> b/include/uapi/linux/virtio_pmem.h
> index d676b3620383..025174f6eacf 100644
> --- a/include/uapi/linux/virtio_pmem.h
> +++ b/include/uapi/linux/virtio_pmem.h
> @@ -14,6 +14,14 @@
>  #include 
>  #include 
>
> +/* Feature bits */
> +#define VIRTIO_PMEM_F_SHMEM_REGION 0   /* guest physical address range will 
> be
> +* indicated as shared memory region 0
> +*/
> +
> +/* shmid of the shared memory region corresponding to the pmem */
> +#define VIRTIO_PMEM_SHMCAP_ID 0

NIT: not a native speaker, but any reason for "CAP" here? Would it be
better to use SHMMEM_REGION_ID?

Thanks

> +
>  struct virtio_pmem_config {
> __le64 start;
> __le64 size;
> --
> 2.43.0.472.g3155946c3a-goog
>

Re: [PATCH] vdpa: Remove usage of the deprecated ida_simple_xx() API

2023-12-19 Thread Jason Wang

On Mon, Dec 11, 2023 at 1:52 AM Christophe JAILLET
 wrote:
>
> ida_alloc() and ida_free() should be preferred to the deprecated
> ida_simple_get() and ida_simple_remove().
>
> This is less verbose.
>
> Signed-off-by: Christophe JAILLET 

Acked-by: Jason Wang 

Thanks

> ---
>  drivers/vdpa/vdpa.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vdpa/vdpa.c b/drivers/vdpa/vdpa.c
> index a7612e0783b3..d0695680b282 100644
> --- a/drivers/vdpa/vdpa.c
> +++ b/drivers/vdpa/vdpa.c
> @@ -131,7 +131,7 @@ static void vdpa_release_dev(struct device *d)
> if (ops->free)
> ops->free(vdev);
>
> -   ida_simple_remove(_index_ida, vdev->index);
> +   ida_free(_index_ida, vdev->index);
> kfree(vdev->driver_override);
> kfree(vdev);
>  }
> @@ -205,7 +205,7 @@ struct vdpa_device *__vdpa_alloc_device(struct device 
> *parent,
> return vdev;
>
>  err_name:
> -   ida_simple_remove(_index_ida, vdev->index);
> +   ida_free(_index_ida, vdev->index);
>  err_ida:
> kfree(vdev);
>  err:
> --
> 2.34.1
>

Re: [PATCH v5 2/4] vduse: Temporarily disable control queue features

2023-12-19 Thread Jason Wang

On Mon, Dec 18, 2023 at 5:21 PM Maxime Coquelin
 wrote:
>
>
>
> On 12/18/23 03:50, Jason Wang wrote:
> > On Wed, Dec 13, 2023 at 7:23 PM Maxime Coquelin
> >  wrote:
> >>
> >> Hi Jason,
> >>
> >> On 12/13/23 05:52, Jason Wang wrote:
> >>> On Tue, Dec 12, 2023 at 9:17 PM Maxime Coquelin
> >>>  wrote:
> >>>>
> >>>> Virtio-net driver control queue implementation is not safe
> >>>> when used with VDUSE. If the VDUSE application does not
> >>>> reply to control queue messages, it currently ends up
> >>>> hanging the kernel thread sending this command.
> >>>>
> >>>> Some work is on-going to make the control queue
> >>>> implementation robust with VDUSE. Until it is completed,
> >>>> let's disable control virtqueue and features that depend on
> >>>> it.
> >>>>
> >>>> Signed-off-by: Maxime Coquelin 
> >>>
> >>> I wonder if it's better to fail instead of a mask as a start.
> >>
> >> I think it is better to use a mask and not fail, so that we can in the
> >> future use a recent VDUSE application with an older kernel.
> >
> > It may confuse the userspace unless userspace can do post check after
> > CREATE_DEV.
> >
> > And for blk we fail when WCE is set in feature_is_valid():
> >
> > static bool features_is_valid(u64 features)
> > {
> >  if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> >  return false;
> >
> >  /* Now we only support read-only configuration space */
> >  if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
> >  return false;
> >
> >  return true;
> > }
>
> Ok, consistency with other devices types is indeed better.
>
> But should I fail if any of the feature advertised by the application is
> not listed by the VDUSE driver, or just fail if control queue is being
> advertised by the application?

Maybe it's better to fail for any other of the features that depend on
the control vq.

Thanks

>
> Thanks,
> Maxime
>
> > Thanks
> >
> >>
> >> Why would it be better to fail than negotiating?
> >>
> >> Thanks,
> >> Maxime
> >>
> >
>

Re: [PATCH RFC 0/4] virtio-net: add tx-hash, rx-tstamp, tx-tstamp and tx-time

2023-12-19 Thread Jason Wang

On Tue, Dec 19, 2023 at 12:36 AM Willem de Bruijn
 wrote:
>
> Steffen Trumtrar wrote:
> > This series tries to pick up the work on the virtio-net timestamping
> > feature from Willem de Bruijn.
> >
> > Original series
> > Message-Id: 20210208185558.995292-1-willemdebruijn.ker...@gmail.com
> > Subject: [PATCH RFC v2 0/4] virtio-net: add tx-hash, rx-tstamp,
> > tx-tstamp and tx-time
> > From: Willem de Bruijn 
> >
> > RFC for four new features to the virtio network device:
> >
> > 1. pass tx flow state to host, for routing + telemetry
> > 2. pass rx tstamp to guest, for better RTT estimation
> > 3. pass tx tstamp to guest, idem
> > 3. pass tx delivery time to host, for accurate pacing
> >
> > All would introduce an extension to the virtio spec.
> >
> > The original series consisted of a hack around the DMA API, which should
> > be fixed in this series.
> >
> > The changes in this series are to the driver side. For the changes to qemu 
> > see:
> > https://github.com/strumtrar/qemu/tree/v8.1.1/virtio-net-ptp
> >
> > Currently only virtio-net is supported. The original series used
> > vhost-net as backend. However, the path through tun via sendmsg doesn't
> > allow us to write data back to the driver side without any hacks.
> > Therefore use the way via plain virtio-net without vhost albeit better
> > performance.
> >
> > Signed-off-by: Steffen Trumtrar 
>
> Thanks for picking this back up, Steffen. Nice to see that the code still
> applies mostly cleanly.
>
> For context: I dropped the work only because I had no real device
> implementation. The referenced patch series to qemu changes that.
>
> I suppose the main issue is the virtio API changes that this introduces,
> which will have to be accepted to the spec.
>
> One small comment to patch 4: there I just assumed the virtual device
> time is CLOCK_TAI. There is a concurrent feature under review for HW
> pacing offload with AF_XDP sockets. The clock issue comes up a bit. In
> general, for hardware we cannot assume a clock.

Any reason for this? E.g some modern NIC have PTP support.

> For virtio, perhaps
> assuming the same monotonic hardware clock in guest and host can be
> assumed.

Note that virtio can be implemented in hardware now. So we can assume
things like the kvm ptp clock.

> But this clock alignment needs some thought.
>

Thanks

Re: [PATCH vhost v4 14/15] vdpa/mlx5: Introduce reference counting to mrs

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:10 AM Dragos Tatulea  wrote:
>
> Deleting the old mr during mr update (.set_map) and then modifying the
> vqs with the new mr is not a good flow for firmware. The firmware
> expects that mkeys are deleted after there are no more vqs referencing
> them.
>
> Introduce reference counting for mrs to fix this. It is the only way to
> make sure that mkeys are not in use by vqs.
>
> An mr reference is taken when the mr is associated to the mr asid table
> and when the mr is linked to the vq on create/modify. The reference is
> released when the mkey is unlinked from the vq (trough modify/destroy)
> and from the mr asid table.
>
> To make things consistent, get rid of mlx5_vdpa_destroy_mr and use
> get/put semantics everywhere.
>
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH vhost v4 13/15] vdpa/mlx5: Use vq suspend/resume during .set_map

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:10 AM Dragos Tatulea  wrote:
>
> Instead of tearing down and setting up vq resources, use vq
> suspend/resume during .set_map to speed things up a bit.
>
> The vq mr is updated with the new mapping while the vqs are suspended.
>
> If the device doesn't support resumable vqs, do the old teardown and
> setup dance.
>
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH vhost v4 12/15] vdpa/mlx5: Mark vq state for modification in hw vq

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:10 AM Dragos Tatulea  wrote:
>
> .set_vq_state will set the indices and mark the fields to be modified in
> the hw vq.
>
> Advertise that the device supports changing the vq state when the device
> is in DRIVER_OK state and suspended.
>
> Reviewed-by: Gal Pressman 
> Signed-off-by: Dragos Tatulea 
> ---

Acked-by: Jason Wang 

Thanks

Re: [PATCH vhost v4 10/15] vdpa/mlx5: Introduce per vq and device resume

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:10 AM Dragos Tatulea  wrote:
>
> Implement vdpa vq and device resume if capability detected. Add support
> for suspend -> ready state change.
>
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks

Re: [PATCH vhost v4 09/15] vdpa/mlx5: Allow modifying multiple vq fields in one modify command

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:10 AM Dragos Tatulea  wrote:
>
> Add a bitmask variable that tracks hw vq field changes that
> are supposed to be modified on next hw vq change command.
>
> This will be useful to set multiple vq fields when resuming the vq.
>
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks

Re: [PATCH vhost v4 06/15] vdpa: Track device suspended state

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  wrote:
>
> Set vdpa device suspended state on successful suspend. Clear it on
> successful resume and reset.
>
> The state will be locked by the vhost_vdpa mutex. The mutex is taken
> during suspend, resume and reset in vhost_vdpa_unlocked_ioctl. The
> exception is vhost_vdpa_open which does a device reset but that should
> be safe because it can only happen before the other ops.
>
> Signed-off-by: Dragos Tatulea 
> Suggested-by: Eugenio Pérez 
> ---
>  drivers/vhost/vdpa.c | 17 +++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index b4e8ddf86485..00b4fa8e89f2 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -59,6 +59,7 @@ struct vhost_vdpa {
> int in_batch;
> struct vdpa_iova_range range;
> u32 batch_asid;
> +   bool suspended;

Any reason why we don't do it in the core vDPA device but here?

Thanks

Re: [PATCH vhost v4 02/15] vdpa: Add VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  wrote:
>
> The virtio spec doesn't allow changing virtqueue addresses after
> DRIVER_OK. Some devices do support this operation when the device is
> suspended. The VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND flag
> advertises this support as a backend features.

There's an ongoing effort in virtio spec to introduce the suspend state.

So I wonder if it's better to just allow such behaviour?

Thanks


>
> Signed-off-by: Dragos Tatulea 
> Suggested-by: Eugenio Pérez 
> ---
>  include/uapi/linux/vhost_types.h | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/uapi/linux/vhost_types.h 
> b/include/uapi/linux/vhost_types.h
> index d7656908f730..aacd067afc89 100644
> --- a/include/uapi/linux/vhost_types.h
> +++ b/include/uapi/linux/vhost_types.h
> @@ -192,5 +192,9 @@ struct vhost_vdpa_iova_range {
>  #define VHOST_BACKEND_F_DESC_ASID0x7
>  /* IOTLB don't flush memory mapping across device reset */
>  #define VHOST_BACKEND_F_IOTLB_PERSIST  0x8
> +/* Device supports changing virtqueue addresses when device is suspended
> + * and is in state DRIVER_OK.
> + */
> +#define VHOST_BACKEND_F_CHANGEABLE_VQ_ADDR_IN_SUSPEND  0x9
>
>  #endif
> --
> 2.43.0
>

Re: [PATCH mlx5-vhost v4 01/15] vdpa/mlx5: Expose resumable vq capability

2023-12-19 Thread Jason Wang

On Wed, Dec 20, 2023 at 2:09 AM Dragos Tatulea  wrote:
>
> Necessary for checking if resumable vqs are supported by the hardware.
> Actual support will be added in a downstream patch.
>
> Reviewed-by: Gal Pressman 
> Acked-by: Eugenio Pérez 
> Signed-off-by: Dragos Tatulea 

Acked-by: Jason Wang 

Thanks


> ---
>  include/linux/mlx5/mlx5_ifc.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
> index 6f3631425f38..9eaceaf6bcb0 100644
> --- a/include/linux/mlx5/mlx5_ifc.h
> +++ b/include/linux/mlx5/mlx5_ifc.h
> @@ -1236,7 +1236,8 @@ struct mlx5_ifc_virtio_emulation_cap_bits {
>
> u8 reserved_at_c0[0x13];
> u8 desc_group_mkey_supported[0x1];
> -   u8 reserved_at_d4[0xc];
> +   u8 freeze_to_rdy_supported[0x1];
> +   u8 reserved_at_d5[0xb];
>
> u8 reserved_at_e0[0x20];
>
> --
> 2.43.0
>

Re: [PATCH v5 2/4] vduse: Temporarily disable control queue features

2023-12-17 Thread Jason Wang

On Wed, Dec 13, 2023 at 7:23 PM Maxime Coquelin
 wrote:
>
> Hi Jason,
>
> On 12/13/23 05:52, Jason Wang wrote:
> > On Tue, Dec 12, 2023 at 9:17 PM Maxime Coquelin
> >  wrote:
> >>
> >> Virtio-net driver control queue implementation is not safe
> >> when used with VDUSE. If the VDUSE application does not
> >> reply to control queue messages, it currently ends up
> >> hanging the kernel thread sending this command.
> >>
> >> Some work is on-going to make the control queue
> >> implementation robust with VDUSE. Until it is completed,
> >> let's disable control virtqueue and features that depend on
> >> it.
> >>
> >> Signed-off-by: Maxime Coquelin 
> >
> > I wonder if it's better to fail instead of a mask as a start.
>
> I think it is better to use a mask and not fail, so that we can in the
> future use a recent VDUSE application with an older kernel.

It may confuse the userspace unless userspace can do post check after
CREATE_DEV.

And for blk we fail when WCE is set in feature_is_valid():

static bool features_is_valid(u64 features)
{
if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
return false;

/* Now we only support read-only configuration space */
if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
return false;

return true;
}

Thanks

>
> Why would it be better to fail than negotiating?
>
> Thanks,
> Maxime
>

Re: [PATCH v5 2/4] vduse: Temporarily disable control queue features

2023-12-12 Thread Jason Wang

On Tue, Dec 12, 2023 at 9:17 PM Maxime Coquelin
 wrote:
>
> Virtio-net driver control queue implementation is not safe
> when used with VDUSE. If the VDUSE application does not
> reply to control queue messages, it currently ends up
> hanging the kernel thread sending this command.
>
> Some work is on-going to make the control queue
> implementation robust with VDUSE. Until it is completed,
> let's disable control virtqueue and features that depend on
> it.
>
> Signed-off-by: Maxime Coquelin 

I wonder if it's better to fail instead of a mask as a start.

Thanks

> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 37 ++
>  1 file changed, 37 insertions(+)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 0486ff672408..fe4b5c8203fd 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>
>  #include "iova_domain.h"
> @@ -46,6 +47,30 @@
>
>  #define IRQ_UNBOUND -1
>
> +#define VDUSE_NET_VALID_FEATURES_MASK   \
> +   (BIT_ULL(VIRTIO_NET_F_CSUM) |   \
> +BIT_ULL(VIRTIO_NET_F_GUEST_CSUM) | \
> +BIT_ULL(VIRTIO_NET_F_MTU) |\
> +BIT_ULL(VIRTIO_NET_F_MAC) |\
> +BIT_ULL(VIRTIO_NET_F_GUEST_TSO4) | \
> +BIT_ULL(VIRTIO_NET_F_GUEST_TSO6) | \
> +BIT_ULL(VIRTIO_NET_F_GUEST_ECN) |  \
> +BIT_ULL(VIRTIO_NET_F_GUEST_UFO) |  \
> +BIT_ULL(VIRTIO_NET_F_HOST_TSO4) |  \
> +BIT_ULL(VIRTIO_NET_F_HOST_TSO6) |  \
> +BIT_ULL(VIRTIO_NET_F_HOST_ECN) |   \
> +BIT_ULL(VIRTIO_NET_F_HOST_UFO) |   \
> +BIT_ULL(VIRTIO_NET_F_MRG_RXBUF) |  \
> +BIT_ULL(VIRTIO_NET_F_STATUS) | \
> +BIT_ULL(VIRTIO_NET_F_HOST_USO) |   \
> +BIT_ULL(VIRTIO_F_ANY_LAYOUT) | \
> +BIT_ULL(VIRTIO_RING_F_INDIRECT_DESC) | \
> +BIT_ULL(VIRTIO_RING_F_EVENT_IDX) |  \
> +BIT_ULL(VIRTIO_F_VERSION_1) |  \
> +BIT_ULL(VIRTIO_F_ACCESS_PLATFORM) | \
> +BIT_ULL(VIRTIO_F_RING_PACKED) |\
> +BIT_ULL(VIRTIO_F_IN_ORDER))
> +
>  struct vduse_virtqueue {
> u16 index;
> u16 num_max;
> @@ -1782,6 +1807,16 @@ static struct attribute *vduse_dev_attrs[] = {
>
>  ATTRIBUTE_GROUPS(vduse_dev);
>
> +static void vduse_dev_features_filter(struct vduse_dev_config *config)
> +{
> +   /*
> +* Temporarily filter out virtio-net's control virtqueue and features
> +* that depend on it while CVQ is being made more robust for VDUSE.
> +*/
> +   if (config->device_id == VIRTIO_ID_NET)
> +   config->features &= VDUSE_NET_VALID_FEATURES_MASK;
> +}
> +
>  static int vduse_create_dev(struct vduse_dev_config *config,
> void *config_buf, u64 api_version)
>  {
> @@ -1797,6 +1832,8 @@ static int vduse_create_dev(struct vduse_dev_config 
> *config,
> if (!dev)
> goto err;
>
> +   vduse_dev_features_filter(config);
> +
> dev->api_version = api_version;
> dev->device_features = config->features;
> dev->device_id = config->device_id;
> --
> 2.43.0
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4435 matches

Mail list logo