Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Samudrala, Sridhar

On 4/10/2018 11:03 PM, Jiri Pirko wrote:

Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudr...@intel.com wrote:

On 4/10/2018 8:43 AM, Jiri Pirko wrote:

Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com wrote:

On 4/10/2018 8:22 AM, Jiri Pirko wrote:

Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:

On 4/10/2018 3:55 AM, Jiri Pirko wrote:

Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:

On 4/9/2018 1:07 AM, Jiri Pirko wrote:

Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:

On 4/6/2018 5:48 AM, Jiri Pirko wrote:

Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:

[...]


+static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
+struct net_device *child_netdev)
+{
+   struct virtnet_bypass_info *vbi;
+   bool backup;
+
+   vbi = netdev_priv(bypass_netdev);
+   backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
+   if (backup ? rtnl_dereference(vbi->backup_netdev) :
+   rtnl_dereference(vbi->active_netdev)) {
+   netdev_info(bypass_netdev,
+   "%s attempting to join bypass dev when %s already 
present\n",
+   child_netdev->name, backup ? "backup" : "active");

Bypass module should check if there is already some other netdev
enslaved and refuse right there.

This will work for virtio-net with 3 netdev model, but this check has to be 
done by netvsc
as its bypass_netdev is same as the backup_netdev.
Will add a flag while registering with the bypass module to indicate if the 
driver is doing
a 2 netdev or 3 netdev model and based on that flag this check can be done in 
bypass module
for 3 netdev scenario.

Just let me undestand it clearly. What I expect the difference would be
between 2netdev and3 netdev model is this:
2netdev:
   bypass_master
  /
 /
VF_slave

3netdev:
   bypass_master
  / \
 /   \
VF_slave   backup_slave

Is that correct? If not, how does it look like?



Looks correct.
VF_slave and backup_slave are the original netdevs and are present in both the 
models.
In the 3 netdev model, bypass_master netdev is created and VF_slave and 
backup_slave are
marked as the 2 slaves of this new netdev.

You say it looks correct and in another sentence you provide completely
different description. Could you please look again?


To be exact, 2 netdev model with netvsc looks like this.

  netvsc_netdev
/
   /
VF_slave

With virtio_net, 3 netdev model

bypass_netdev
/ \
   /   \
VF_slave   virtio_net netdev

Could you also mark the original netdev which is there now? is it
bypass_netdev or virtio_net_netdev ?

bypass_netdev
  / \
 /   \
VF_slave   virtio_net netdev (original)

That does not make sense.
1) You diverge from the behaviour of the netvsc, where the original
 netdev is a master of the VF
2) If the original netdev is a slave, you cannot have any IP address
 configured on it (well you could, but the rx_handler would eat every
 incoming packet). So you will break the user bacause he would have to
 move the configuration to the new master device.
This only makes sense that the original netdev becomes the master for both
netvsc and virtio_net.

Forgot to mention that bypass_netdev takes over the name of the original
netdev and
virtio_net netdev will get the backup name.

What do you mean by "name"?


bypass_netdev also is associated with the same pci device as the original 
virtio_net
netdev via SET_NETDEV_DEV().  Also, we added ndo_get_phys_port_name() to 
virtio_net
that will return _bkup when BACKUP feature is enabled.

So for ex: if virtio_net inteface was getting 'ens12' as the name assigned by 
udev
without BACKUP feature,  when BACKUP feature is enabled,  the  bypass_netdev 
will be
named 'ens12' and the original virtio_net will get named as ens12n_bkup.





So the userspace network configuration doesn't need to change.




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Jiri Pirko
Tue, Apr 10, 2018 at 05:59:02PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/10/2018 8:43 AM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> > > Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:
>> > > > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> > > > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com 
>> > > > > wrote:
>> > > > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, 
>> > > > > > > sridhar.samudr...@intel.com wrote:
>> > > > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, 
>> > > > > > > > > sridhar.samudr...@intel.com wrote:
>> > > > > > > [...]
>> > > > > > > 
>> > > > > > > > > > +static int virtnet_bypass_join_child(struct net_device 
>> > > > > > > > > > *bypass_netdev,
>> > > > > > > > > > +   struct net_device 
>> > > > > > > > > > *child_netdev)
>> > > > > > > > > > +{
>> > > > > > > > > > +  struct virtnet_bypass_info *vbi;
>> > > > > > > > > > +  bool backup;
>> > > > > > > > > > +
>> > > > > > > > > > +  vbi = netdev_priv(bypass_netdev);
>> > > > > > > > > > +  backup = (child_netdev->dev.parent == 
>> > > > > > > > > > bypass_netdev->dev.parent);
>> > > > > > > > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > > > > > +  rtnl_dereference(vbi->active_netdev)) {
>> > > > > > > > > > +  netdev_info(bypass_netdev,
>> > > > > > > > > > +  "%s attempting to join bypass dev 
>> > > > > > > > > > when %s already present\n",
>> > > > > > > > > > +  child_netdev->name, backup ? 
>> > > > > > > > > > "backup" : "active");
>> > > > > > > > > Bypass module should check if there is already some other 
>> > > > > > > > > netdev
>> > > > > > > > > enslaved and refuse right there.
>> > > > > > > > This will work for virtio-net with 3 netdev model, but this 
>> > > > > > > > check has to be done by netvsc
>> > > > > > > > as its bypass_netdev is same as the backup_netdev.
>> > > > > > > > Will add a flag while registering with the bypass module to 
>> > > > > > > > indicate if the driver is doing
>> > > > > > > > a 2 netdev or 3 netdev model and based on that flag this check 
>> > > > > > > > can be done in bypass module
>> > > > > > > > for 3 netdev scenario.
>> > > > > > > Just let me undestand it clearly. What I expect the difference 
>> > > > > > > would be
>> > > > > > > between 2netdev and3 netdev model is this:
>> > > > > > > 2netdev:
>> > > > > > >   bypass_master
>> > > > > > >  /
>> > > > > > > /
>> > > > > > > VF_slave
>> > > > > > > 
>> > > > > > > 3netdev:
>> > > > > > >   bypass_master
>> > > > > > >  / \
>> > > > > > > /   \
>> > > > > > > VF_slave   backup_slave
>> > > > > > > 
>> > > > > > > Is that correct? If not, how does it look like?
>> > > > > > > 
>> > > > > > > 
>> > > > > > Looks correct.
>> > > > > > VF_slave and backup_slave are the original netdevs and are present 
>> > > > > > in both the models.
>> > > > > > In the 3 netdev model, bypass_master netdev is created and 
>> > > > > > VF_slave and backup_slave are
>> > > > > > marked as the 2 slaves of this new netdev.
>> > > > > You say it looks correct and in another sentence you provide 
>> > > > > completely
>> > > > > different description. Could you please look again?
>> > > > > 
>> > > > To be exact, 2 netdev model with netvsc looks like this.
>> > > > 
>> > > >  netvsc_netdev
>> > > >/
>> > > >   /
>> > > > VF_slave
>> > > > 
>> > > > With virtio_net, 3 netdev model
>> > > > 
>> > > >bypass_netdev
>> > > >/ \
>> > > >   /   \
>> > > > VF_slave   virtio_net netdev
>> > > Could you also mark the original netdev which is there now? is it
>> > > bypass_netdev or virtio_net_netdev ?
>> > bypass_netdev
>> >  / \
>> > /   \
>> > VF_slave   virtio_net netdev (original)
>> That does not make sense.
>> 1) You diverge from the behaviour of the netvsc, where the original
>> netdev is a master of the VF
>> 2) If the original netdev is a slave, you cannot have any IP address
>> configured on it (well you could, but the rx_handler would eat every
>> incoming packet). So you will break the user bacause he would have to
>> move the configuration to the new master device.
>> This only makes sense that the original netdev becomes the master for both
>> netvsc and virtio_net.
>Forgot to mention that bypass_netdev takes over the name of the original
>netdev and
>virtio_net netdev will get the backup name.

What do you mean by "name"?


>So the userspace network configuration doesn't need to change.
>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/vi

Re: [PATCH v3 0/2] vhost: fix vhost_vq_access_ok() log check

2018-04-10 Thread Jason Wang



On 2018年04月11日 10:35, Stefan Hajnoczi wrote:

v3:
  * Rebased onto net/master and resolved conflict [DaveM]

v2:
  * Rewrote the conditional to make the vq access check clearer [Linus]
  * Added Patch 2 to make the return type consistent and harder to misuse 
[Linus]

The first patch fixes the vhost virtqueue access check which was recently
broken.  The second patch replaces the int return type with bool to prevent
future bugs.

Stefan Hajnoczi (2):
   vhost: fix vhost_vq_access_ok() log check
   vhost: return bool from *_access_ok() functions

  drivers/vhost/vhost.h |  4 +--
  drivers/vhost/vhost.c | 70 ++-
  2 files changed, 38 insertions(+), 36 deletions(-)



Acked-by: Jason Wang 

Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Re: [PATCH v2] virtio_balloon: export hugetlb page allocation counts

2018-04-10 Thread Jason Wang



On 2018年04月10日 05:11, Jonathan Helman wrote:



On 03/22/2018 07:38 PM, Jason Wang wrote:



On 2018年03月22日 11:10, Michael S. Tsirkin wrote:

On Thu, Mar 22, 2018 at 09:52:18AM +0800, Jason Wang wrote:

On 2018年03月20日 12:26, Jonathan Helman wrote:

On Mar 19, 2018, at 7:31 PM, Jason Wang  wrote:



On 2018年03月20日 06:14, Jonathan Helman wrote:

Export the number of successful and failed hugetlb page
allocations via the virtio balloon driver. These 2 counts
come directly from the vm_events HTLB_BUDDY_PGALLOC and
HTLB_BUDDY_PGALLOC_FAIL.

Signed-off-by: Jonathan Helman

Reviewed-by: Jason Wang

Thanks.


---
   drivers/virtio/virtio_balloon.c | 6 ++
   include/uapi/linux/virtio_balloon.h | 4 +++-
   2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c 
b/drivers/virtio/virtio_balloon.c

index dfe5684..6b237e3 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -272,6 +272,12 @@ static unsigned int 
update_balloon_stats(struct virtio_balloon *vb)

   pages_to_bytes(events[PSWPOUT]));
   update_stat(vb, idx++, VIRTIO_BALLOON_S_MAJFLT, 
events[PGMAJFAULT]);
   update_stat(vb, idx++, VIRTIO_BALLOON_S_MINFLT, 
events[PGFAULT]);

+#ifdef CONFIG_HUGETLB_PAGE
+    update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGALLOC,
+    events[HTLB_BUDDY_PGALLOC]);
+    update_stat(vb, idx++, VIRTIO_BALLOON_S_HTLB_PGFAIL,
+    events[HTLB_BUDDY_PGALLOC_FAIL]);
+#endif
   #endif
   update_stat(vb, idx++, VIRTIO_BALLOON_S_MEMFREE,
   pages_to_bytes(i.freeram));
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h

index 4e8b830..40297a3 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -53,7 +53,9 @@ struct virtio_balloon_config {
   #define VIRTIO_BALLOON_S_MEMTOT   5   /* Total amount of 
memory */
   #define VIRTIO_BALLOON_S_AVAIL    6   /* Available memory as 
in /proc */

   #define VIRTIO_BALLOON_S_CACHES   7   /* Disk caches */
-#define VIRTIO_BALLOON_S_NR   8
+#define VIRTIO_BALLOON_S_HTLB_PGALLOC  8  /* Hugetlb page 
allocations */
+#define VIRTIO_BALLOON_S_HTLB_PGFAIL   9  /* Hugetlb page 
allocation failures */

+#define VIRTIO_BALLOON_S_NR   10
 /*
    * Memory statistics structure.
Not for this patch, but it looks to me that exporting such nr 
through uapi is fragile.

Sorry, can you explain what you mean here?

Jon
Spec said "Within an output buffer submitted to the statsq, the 
device MUST
ignore entries with tag values that it does not recognize". So 
exporting
VIRTIO_BALLOON_S_NR seems useless and device implementation can not 
depend

on such number in uapi.

Thanks

Suggestions? I don't like to break build for people ...



Didn't have a good idea. But maybe we should keep VIRTIO_BALLOON_S_NR 
unchanged, and add a comment here.


Thanks


I think Jason's comment is for a future patch. Didn't see this patch 
get applied, so wondering if it could be.


Thanks,
Jon


Hi Jon:

Have you tested new driver with old qemu?

Thanks


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v3 1/2] vhost: fix vhost_vq_access_ok() log check

2018-04-10 Thread Stefan Hajnoczi
Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
when IOTLB is enabled") introduced a regression.  The logic was
originally:

  if (vq->iotlb)
  return 1;
  return A && B;

After the patch the short-circuit logic for A was inverted:

  if (A || vq->iotlb)
  return A;
  return B;

This patch fixes the regression by rewriting the checks in the obvious
way, no longer returning A when vq->iotlb is non-NULL (which is hard to
understand).

Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
Cc: Jason Wang 
Signed-off-by: Stefan Hajnoczi 
---
 drivers/vhost/vhost.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index bec722e41f58..fc805b7fad9d 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
 /* Caller should have vq mutex and device mutex */
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
-   int ret = vq_log_access_ok(vq, vq->log_base);
+   if (!vq_log_access_ok(vq, vq->log_base))
+   return 0;
 
-   if (ret || vq->iotlb)
-   return ret;
+   /* Access validation occurs at prefetch time with IOTLB */
+   if (vq->iotlb)
+   return 1;
 
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
 }
-- 
2.14.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 2/2] vhost: return bool from *_access_ok() functions

2018-04-10 Thread Stefan Hajnoczi
Currently vhost *_access_ok() functions return int.  This is error-prone
because there are two popular conventions:

1. 0 means failure, 1 means success
2. -errno means failure, 0 means success

Although vhost mostly uses #1, it does not do so consistently.
umem_access_ok() uses #2.

This patch changes the return type from int to bool so that false means
failure and true means success.  This eliminates a potential source of
errors.

Suggested-by: Linus Torvalds 
Signed-off-by: Stefan Hajnoczi 
---
 drivers/vhost/vhost.h |  4 ++--
 drivers/vhost/vhost.c | 66 +--
 2 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d8ee85ae8fdc..6c844b90a168 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
 void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user 
*argp);
 long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user 
*argp);
-int vhost_vq_access_ok(struct vhost_virtqueue *vq);
-int vhost_log_access_ok(struct vhost_dev *);
+bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
+bool vhost_log_access_ok(struct vhost_dev *);
 
 int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct iovec iov[], unsigned int iov_count,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index fc805b7fad9d..0fcb51a9940c 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
 
-static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
+static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 {
u64 a = addr / VHOST_PAGE_SIZE / 8;
 
/* Make sure 64 bit math will not overflow. */
if (a > ULONG_MAX - (unsigned long)log_base ||
a + (unsigned long)log_base > ULONG_MAX)
-   return 0;
+   return false;
 
return access_ok(VERIFY_WRITE, log_base + a,
 (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
@@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
 }
 
 /* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
-  int log_all)
+static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
+   int log_all)
 {
struct vhost_umem_node *node;
 
if (!umem)
-   return 0;
+   return false;
 
list_for_each_entry(node, &umem->umem_list, link) {
unsigned long a = node->userspace_addr;
 
if (vhost_overflow(node->userspace_addr, node->size))
-   return 0;
+   return false;
 
 
if (!access_ok(VERIFY_WRITE, (void __user *)a,
node->size))
-   return 0;
+   return false;
else if (log_all && !log_access_ok(log_base,
   node->start,
   node->size))
-   return 0;
+   return false;
}
-   return 1;
+   return true;
 }
 
 static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
@@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct 
vhost_virtqueue *vq,
 
 /* Can we switch to this memory table? */
 /* Caller should have device mutex but not vq mutex */
-static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
-   int log_all)
+static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
+int log_all)
 {
int i;
 
for (i = 0; i < d->nvqs; ++i) {
-   int ok;
+   bool ok;
bool log;
 
mutex_lock(&d->vqs[i]->mutex);
@@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct 
vhost_umem *umem,
ok = vq_memory_access_ok(d->vqs[i]->log_base,
 umem, log);
else
-   ok = 1;
+   ok = true;
mutex_unlock(&d->vqs[i]->mutex);
if (!ok)
-   return 0;
+   return false;
}
-   return 1;
+   return true;
 }
 
 static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
@@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
spin_unlock(&d->iotlb_lock);
 }
 
-static int umem_access_ok(u64 uaddr, u64 size, int access)
+static bool umem_access_ok(u64 uad

[PATCH v3 0/2] vhost: fix vhost_vq_access_ok() log check

2018-04-10 Thread Stefan Hajnoczi
v3:
 * Rebased onto net/master and resolved conflict [DaveM]

v2:
 * Rewrote the conditional to make the vq access check clearer [Linus]
 * Added Patch 2 to make the return type consistent and harder to misuse [Linus]

The first patch fixes the vhost virtqueue access check which was recently
broken.  The second patch replaces the int return type with bool to prevent
future bugs.

Stefan Hajnoczi (2):
  vhost: fix vhost_vq_access_ok() log check
  vhost: return bool from *_access_ok() functions

 drivers/vhost/vhost.h |  4 +--
 drivers/vhost/vhost.c | 70 ++-
 2 files changed, 38 insertions(+), 36 deletions(-)

-- 
2.14.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Jason Wang



On 2018年04月10日 22:23, Liang, Cunming wrote:

-Original Message-
From: Michael S. Tsirkin [mailto:m...@redhat.com]
Sent: Tuesday, April 10, 2018 9:36 PM
To: Liang, Cunming
Cc: Paolo Bonzini; Bie, Tiwei;
Jason Wang;alex.william...@redhat.com;
ddut...@redhat.com; Duyck, Alexander H;
virtio-...@lists.oasis-open.org;linux-ker...@vger.kernel.org;
k...@vger.kernel.org;virtualization@lists.linux-foundation.org;
net...@vger.kernel.org; Daly, Dan; Wang, Zhihong
; Tan, Jianfeng; Wang, Xiao
W
Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost
backend

On Tue, Apr 10, 2018 at 09:23:53AM +, Liang, Cunming wrote:

-Original Message-
From: Paolo Bonzini [mailto:pbonz...@redhat.com]
Sent: Tuesday, April 10, 2018 3:52 PM
To: Bie, Tiwei; Jason Wang

Cc:m...@redhat.com;alex.william...@redhat.com;ddut...@redhat.com;
Duyck, Alexander H;
virtio-dev@lists.oasis- open.org;linux-ker...@vger.kernel.org;
k...@vger.kernel.org;virtualization@lists.linux-foundation.org;
net...@vger.kernel.org; Daly, Dan; Liang,
Cunming; Wang, Zhihong
; Tan, Jianfeng;
Wang, Xiao W
Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based
hardware vhost backend

On 10/04/2018 06:57, Tiwei Bie wrote:

So you just move the abstraction layer from qemu to kernel, and
you still need different drivers in kernel for different device
interfaces of accelerators. This looks even more complex than
leaving it in qemu. As you said, another idea is to implement
userspace vhost backend for accelerators which seems easier and
could co-work with other parts of qemu without inventing new type of

messages.

I'm not quite sure. Do you think it's acceptable to add various
vendor specific hardware drivers in QEMU?

I think so.  We have vendor-specific quirks, and at some point there
was an idea of using quirks to implement (vendor-specific) live
migration support for assigned devices.

Vendor-specific quirks of accessing VGA is a small portion. Other major portions

are still handled by guest driver.

While in this case, when saying various vendor specific drivers in QEMU, it says

QEMU takes over and provides the entire user space device drivers. Some parts
are even not relevant to vhost, they're basic device function enabling. 
Moreover,
it could be different kinds of devices(network/block/...) under vhost. No matter
# of vendors or # of types, total LOC is not small.

The idea is to avoid introducing these extra complexity out of QEMU, keeping

vhost adapter simple. As vhost protocol is de factor standard, it leverages 
kernel
device driver to provide the diversity. Changing once in QEMU, then it supports
multi-vendor devices whose drivers naturally providing kernel driver there.

If QEMU is going to build a user space driver framework there, we're open mind

on that, even leveraging DPDK as the underlay library. Looking forward to more
others' comments from community.

Steve

Dependency on a kernel driver is fine IMHO. It's the dependency on a DPDK
backend that makes people unhappy, since the functionality in question is setup-
time only.

Agreed, we don't see dependency on kernel driver is a problem.


At engineering level, kernel driver is harder than userspace driver.



mdev based vhost backend (this patch set) is independent with vhost-user 
extension patch set. In fact, there're a few vhost-user providers, DPDK 
librte_vhost is one of them. FD.IO/VPP and snabbswitch have their own 
vhost-user providers. So I can't agree on vhost-user extension patch depends on 
DPDK backend. But anyway, that's the topic of another mail thread.



Well we can treat mdev as another kind of transport of vhost-user. And 
technically we can even implement a relay mdev than forward vhost-user 
messages to dpdk.


Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Jason Wang



On 2018年04月10日 17:23, Liang, Cunming wrote:



-Original Message-
From: Paolo Bonzini [mailto:pbonz...@redhat.com]
Sent: Tuesday, April 10, 2018 3:52 PM
To: Bie, Tiwei ; Jason Wang 
Cc: m...@redhat.com; alex.william...@redhat.com; ddut...@redhat.com;
Duyck, Alexander H ; virtio-dev@lists.oasis-
open.org; linux-ker...@vger.kernel.org; k...@vger.kernel.org;
virtualization@lists.linux-foundation.org; net...@vger.kernel.org; Daly, Dan
; Liang, Cunming ; Wang,
Zhihong ; Tan, Jianfeng ;
Wang, Xiao W 
Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware
vhost backend

On 10/04/2018 06:57, Tiwei Bie wrote:

So you just move the abstraction layer from qemu to kernel, and you
still need different drivers in kernel for different device
interfaces of accelerators. This looks even more complex than leaving
it in qemu. As you said, another idea is to implement userspace vhost
backend for accelerators which seems easier and could co-work with
other parts of qemu without inventing new type of messages.

I'm not quite sure. Do you think it's acceptable to add various vendor
specific hardware drivers in QEMU?

I think so.  We have vendor-specific quirks, and at some point there was an
idea of using quirks to implement (vendor-specific) live migration support for
assigned devices.

Vendor-specific quirks of accessing VGA is a small portion. Other major 
portions are still handled by guest driver.

While in this case, when saying various vendor specific drivers in QEMU, it 
says QEMU takes over and provides the entire user space device drivers. Some 
parts are even not relevant to vhost, they're basic device function enabling. 
Moreover, it could be different kinds of devices(network/block/...) under 
vhost. No matter # of vendors or # of types, total LOC is not small.

The idea is to avoid introducing these extra complexity out of QEMU, keeping 
vhost adapter simple. As vhost protocol is de factor standard, it leverages 
kernel device driver to provide the diversity. Changing once in QEMU, then it 
supports multi-vendor devices whose drivers naturally providing kernel driver 
there.


Let me clarify my question, it's not qemu vs kenrel but userspace vs 
kernel. It could be a library which could be linked to qemu. Doing it in 
userspace have the following obvious advantages:


- attack surface was limited to userspace
- easier to be maintained (compared to kernel driver)
- easier to be extended without introducing new userspace/kernel interfaces
- not tied to a specific operating system

If we want to do it in the kernel, need to consider to unify code 
between mdev device driver and generic driver. For net driver, maybe we 
can even consider to do it on top of exist drivers.




If QEMU is going to build a user space driver framework there, we're open mind 
on that, even leveraging DPDK as the underlay library. Looking forward to more 
others' comments from community.


I'm doing this now by implementing vhost inside qemu IOThreads. Hope I 
can post RFC in few months.


Thanks


Steve


Paolo


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Stefan Hajnoczi
On Tue, Apr 10, 2018 at 09:23:53AM +, Liang, Cunming wrote:
> If QEMU is going to build a user space driver framework there, we're open 
> mind on that, even leveraging DPDK as the underlay library. Looking forward 
> to more others' comments from community.

There is already an NVMe VFIO driver in QEMU (see block/nvme.c).  So in
principle there's no reason against userspace drivers in QEMU.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Tian, Kevin
> From: Liang, Cunming
> Sent: Tuesday, April 10, 2018 10:24 PM
> 
[...]
> >
> > As others said, we do not need to go overeboard. A couple of small
> vendor-
> > specific quirks in qemu isn't a big deal.
> 
> It's quite challenge to identify it's small or not, there's no uniform metric.
> 
> It's only dependent on QEMU itself, that's the obvious benefit. Tradeoff is
> to build the entire device driver. We don't object to do that in QEMU, but
> wanna make sure to understand the boundary size clearly.
> 

It might be helpful if you can post some sample code using proposed 
framework - then people have a clear feeling about what size is talked 
about here and whether it falls into the concept of 'small quirks'.

Thanks
Kevin
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala  wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala 
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.
> 
> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.

Well right now netvsc does an indirect call to the PT device,
does it not? If you really want max performance when PT
is in use you need to do the reverse and have PT forward to netvsc.

> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check

2018-04-10 Thread Stefan Hajnoczi
On Tue, Apr 10, 2018 at 10:50:43AM -0400, David Miller wrote:
> From: Jason Wang 
> Date: Tue, 10 Apr 2018 14:40:10 +0800
> 
> > On 2018年04月10日 13:26, Stefan Hajnoczi wrote:
> >> v2:
> >>   * Rewrote the conditional to make the vq access check clearer [Linus]
> >>   * Added Patch 2 to make the return type consistent and harder to misuse
> >>   * [Linus]
> >>
> >> The first patch fixes the vhost virtqueue access check which was
> >> recently
> >> broken.  The second patch replaces the int return type with bool to
> >> prevent
> >> future bugs.
> >>
> >> Stefan Hajnoczi (2):
> >>vhost: fix vhost_vq_access_ok() log check
> >>vhost: return bool from *_access_ok() functions
> >>
> >>   drivers/vhost/vhost.h |  4 +--
> >>   drivers/vhost/vhost.c | 70
> >>   ++-
> >>   2 files changed, 38 insertions(+), 36 deletions(-)
> >>
> > 
> > Acked-by: Jason Wang 
> 
> This sereis doesn't apply cleanly to the net tree, please respin
> and add the appropriate Acks and Fixes: tags that Michael and Jason
> have provided.

Sorry, will fix.

Stefan


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v29 1/4] mm: support reporting free page blocks

2018-04-10 Thread Wei Wang

On 04/11/2018 07:25 AM, Michael S. Tsirkin wrote:

On Tue, Apr 10, 2018 at 01:54:29PM -0700, Andrew Morton wrote:

On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin"  wrote:


Andrew, were your questions answered? If yes could I bother you for an ack on 
this?


Still not very happy that readers are told that "this function may
sleep" when it clearly doesn't do so.  If we wish to be able to change
it to sleep in the future then that should be mentioned.  And even put a
might_sleep() in there, to catch people who didn't read the comments...

Otherwise it looks OK.

Oh, might_sleep with a comment explaining it's for the future sounds
good to me. I queued this - Wei, could you post a patch on top pls?



I'm just thinking if it would be necessary to add another might_sleep, 
because we've had a cond_resched there which has wrapped a __might_sleep.


Best,
Wei
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Stephen Hemminger
On Tue, 10 Apr 2018 16:44:47 -0700
Siwei Liu  wrote:

> On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin  wrote:
> > On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:  
> >> On Tue, 10 Apr 2018 11:59:50 -0700
> >> Sridhar Samudrala  wrote:
> >>  
> >> > Use the registration/notification framework supported by the generic
> >> > bypass infrastructure.
> >> >
> >> > Signed-off-by: Sridhar Samudrala 
> >> > ---  
> >>
> >> Thanks for doing this.  Your current version has couple show stopper
> >> issues.
> >>
> >> First, the slave device is instantly taking over the slave.
> >> This doesn't allow udev/systemd to do its device rename of the slave
> >> device. Netvsc uses a delayed work to workaround this.  
> >
> > Interesting. Does this mean udev must act within a specific time window
> > then?  
> 
> Sighs, lots of hacks. Why propgating this from driver to a common
> module. We really need a clean solution.
> 

I had a patch to wait for udev to do the rename and go from there
but davem rejected it.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Siwei Liu
On Tue, Apr 10, 2018 at 4:28 PM, Michael S. Tsirkin  wrote:
> On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
>> On Tue, 10 Apr 2018 11:59:50 -0700
>> Sridhar Samudrala  wrote:
>>
>> > Use the registration/notification framework supported by the generic
>> > bypass infrastructure.
>> >
>> > Signed-off-by: Sridhar Samudrala 
>> > ---
>>
>> Thanks for doing this.  Your current version has couple show stopper
>> issues.
>>
>> First, the slave device is instantly taking over the slave.
>> This doesn't allow udev/systemd to do its device rename of the slave
>> device. Netvsc uses a delayed work to workaround this.
>
> Interesting. Does this mean udev must act within a specific time window
> then?

Sighs, lots of hacks. Why propgating this from driver to a common
module. We really need a clean solution.

-Siwei


>
>> Secondly, the select queue needs to call queue selection in VF.
>> The bonding/teaming logic doesn't work well for UDP flows.
>> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
>> fixed this performance problem.
>>
>> Lastly, more indirection is bad in current climate.
>>
>> I am not completely adverse to this but it needs to be fast, simple
>> and completely transparent.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2018 at 02:26:08PM -0700, Stephen Hemminger wrote:
> On Tue, 10 Apr 2018 11:59:50 -0700
> Sridhar Samudrala  wrote:
> 
> > Use the registration/notification framework supported by the generic
> > bypass infrastructure.
> > 
> > Signed-off-by: Sridhar Samudrala 
> > ---
> 
> Thanks for doing this.  Your current version has couple show stopper
> issues.
> 
> First, the slave device is instantly taking over the slave.
> This doesn't allow udev/systemd to do its device rename of the slave
> device. Netvsc uses a delayed work to workaround this.

Interesting. Does this mean udev must act within a specific time window
then?

> Secondly, the select queue needs to call queue selection in VF.
> The bonding/teaming logic doesn't work well for UDP flows.
> Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
> fixed this performance problem.
> 
> Lastly, more indirection is bad in current climate.
> 
> I am not completely adverse to this but it needs to be fast, simple
> and completely transparent.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v29 1/4] mm: support reporting free page blocks

2018-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2018 at 01:54:29PM -0700, Andrew Morton wrote:
> On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin"  
> wrote:
> 
> > 
> > Andrew, were your questions answered? If yes could I bother you for an ack 
> > on this?
> > 
> 
> Still not very happy that readers are told that "this function may
> sleep" when it clearly doesn't do so.  If we wish to be able to change
> it to sleep in the future then that should be mentioned.  And even put a
> might_sleep() in there, to catch people who didn't read the comments...
> 
> Otherwise it looks OK.

Oh, might_sleep with a comment explaining it's for the future sounds
good to me. I queued this - Wei, could you post a patch on top pls?

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Samudrala, Sridhar

On 4/10/2018 2:26 PM, Stephen Hemminger wrote:

On Tue, 10 Apr 2018 11:59:50 -0700
Sridhar Samudrala  wrote:


Use the registration/notification framework supported by the generic
bypass infrastructure.

Signed-off-by: Sridhar Samudrala 
---

Thanks for doing this.  Your current version has couple show stopper
issues.

First, the slave device is instantly taking over the slave.
This doesn't allow udev/systemd to do its device rename of the slave
device. Netvsc uses a delayed work to workaround this.


OK. I guess you are referring to the dev_set_mtu() and dev_open() calls that are
made in bypass_slave_register() and you want to defer them to be done after
a delay.  I could avoid these calls in case of netvsc based on bypass_ops.




Secondly, the select queue needs to call queue selection in VF.
The bonding/teaming logic doesn't work well for UDP flows.
Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
fixed this performance problem.


netvsc should not be using bypass_select_queue() as  that ndo op gets used
only with 3-netdev model.
Anyway, will look into updating bypass_select_queue() based on your fix.



Lastly, more indirection is bad in current climate.

I am not completely adverse to this but it needs to be fast, simple
and completely transparent.


Not sure we can avoid this indirection if we want to commonize the code,  but 
use
different models for virtio-net and netvsc.

On the other hand, these patches avoid calls to get_netvsc_bymac() and
get_netvsc_by_ref() that go through all the devices for all the netdev events.
netvsc lookups should be much faster.

Thanks
Sridhar

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Stephen Hemminger
On Tue, 10 Apr 2018 11:59:50 -0700
Sridhar Samudrala  wrote:

> Use the registration/notification framework supported by the generic
> bypass infrastructure.
> 
> Signed-off-by: Sridhar Samudrala 
> ---

Thanks for doing this.  Your current version has couple show stopper
issues.

First, the slave device is instantly taking over the slave.
This doesn't allow udev/systemd to do its device rename of the slave
device. Netvsc uses a delayed work to workaround this.

Secondly, the select queue needs to call queue selection in VF.
The bonding/teaming logic doesn't work well for UDP flows.
Commit b3bf5666a510 ("hv_netvsc: defer queue selection to VF")
fixed this performance problem.

Lastly, more indirection is bad in current climate.

I am not completely adverse to this but it needs to be fast, simple
and completely transparent.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v29 1/4] mm: support reporting free page blocks

2018-04-10 Thread Andrew Morton
On Tue, 10 Apr 2018 21:19:31 +0300 "Michael S. Tsirkin"  wrote:

> 
> Andrew, were your questions answered? If yes could I bother you for an ack on 
> this?
> 

Still not very happy that readers are told that "this function may
sleep" when it clearly doesn't do so.  If we wish to be able to change
it to sleep in the future then that should be mentioned.  And even put a
might_sleep() in there, to catch people who didn't read the comments...

Otherwise it looks OK.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH net-next v6 2/4] net: Introduce generic bypass module

2018-04-10 Thread Sridhar Samudrala
This provides a generic interface for paravirtual drivers to listen
for netdev register/unregister/link change events from pci ethernet
devices with the same MAC and takeover their datapath. The notifier and
event handling code is based on the existing netvsc implementation.

It exposes 2 sets of interfaces to the paravirtual drivers.
1. existing netvsc driver that uses 2 netdev model. In this model, no
master netdev is created. The paravirtual driver registers each bypass
instance along with a set of ops to manage the slave events.
 bypass_master_register()
 bypass_master_unregister()
2. new virtio_net based solution that uses 3 netdev model. In this model,
the bypass module provides interfaces to create/destroy additional master
netdev and all the slave events are managed internally.
  bypass_master_create()
  bypass_master_destroy()

Signed-off-by: Sridhar Samudrala 
---
 include/linux/netdevice.h |  14 +
 include/net/bypass.h  |  96 ++
 net/Kconfig   |  18 +
 net/core/Makefile |   1 +
 net/core/bypass.c | 844 ++
 5 files changed, 973 insertions(+)
 create mode 100644 include/net/bypass.h
 create mode 100644 net/core/bypass.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503ea81a..587293728f70 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1430,6 +1430,8 @@ enum netdev_priv_flags {
IFF_PHONY_HEADROOM  = 1<<24,
IFF_MACSEC  = 1<<25,
IFF_NO_RX_HANDLER   = 1<<26,
+   IFF_BYPASS  = 1 << 27,
+   IFF_BYPASS_SLAVE= 1 << 28,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1458,6 +1460,8 @@ enum netdev_priv_flags {
 #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
 #define IFF_MACSEC IFF_MACSEC
 #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
+#define IFF_BYPASS IFF_BYPASS
+#define IFF_BYPASS_SLAVE   IFF_BYPASS_SLAVE
 
 /**
  * struct net_device - The DEVICE structure.
@@ -4308,6 +4312,16 @@ static inline bool netif_is_rxfh_configured(const struct 
net_device *dev)
return dev->priv_flags & IFF_RXFH_CONFIGURED;
 }
 
+static inline bool netif_is_bypass_master(const struct net_device *dev)
+{
+   return dev->priv_flags & IFF_BYPASS;
+}
+
+static inline bool netif_is_bypass_slave(const struct net_device *dev)
+{
+   return dev->priv_flags & IFF_BYPASS_SLAVE;
+}
+
 /* This device needs to keep skb dst for qdisc enqueue or ndo_start_xmit() */
 static inline void netif_keep_dst(struct net_device *dev)
 {
diff --git a/include/net/bypass.h b/include/net/bypass.h
new file mode 100644
index ..86b02cb894cf
--- /dev/null
+++ b/include/net/bypass.h
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2018, Intel Corporation. */
+
+#ifndef _NET_BYPASS_H
+#define _NET_BYPASS_H
+
+#include 
+
+struct bypass_ops {
+   int (*slave_pre_register)(struct net_device *slave_netdev,
+ struct net_device *bypass_netdev);
+   int (*slave_join)(struct net_device *slave_netdev,
+ struct net_device *bypass_netdev);
+   int (*slave_pre_unregister)(struct net_device *slave_netdev,
+   struct net_device *bypass_netdev);
+   int (*slave_release)(struct net_device *slave_netdev,
+struct net_device *bypass_netdev);
+   int (*slave_link_change)(struct net_device *slave_netdev,
+struct net_device *bypass_netdev);
+   rx_handler_result_t (*handle_frame)(struct sk_buff **pskb);
+};
+
+struct bypass_master {
+   struct list_head list;
+   struct net_device __rcu *bypass_netdev;
+   struct bypass_ops __rcu *ops;
+};
+
+/* bypass state */
+struct bypass_info {
+   /* passthru netdev with same MAC */
+   struct net_device __rcu *active_netdev;
+
+   /* virtio_net netdev */
+   struct net_device __rcu *backup_netdev;
+
+   /* active netdev stats */
+   struct rtnl_link_stats64 active_stats;
+
+   /* backup netdev stats */
+   struct rtnl_link_stats64 backup_stats;
+
+   /* aggregated stats */
+   struct rtnl_link_stats64 bypass_stats;
+
+   /* spinlock while updating stats */
+   spinlock_t stats_lock;
+};
+
+#if IS_ENABLED(CONFIG_NET_BYPASS)
+
+int bypass_master_create(struct net_device *backup_netdev,
+struct bypass_master **pbypass_master);
+void bypass_master_destroy(struct bypass_master *bypass_master);
+
+int bypass_master_register(struct net_device *dev, struct bypass_ops *ops,
+  struct bypass_master **pbypass_master);
+void bypass_master_unregister(struct bypass_master *bypass_master);
+
+int bypass_slave_unregister(struct net_device *slave_netdev);
+
+#else
+
+stati

[RFC PATCH net-next v6 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Sridhar Samudrala
This patch enables virtio_net to switch over to a VF datapath when a VF
netdev is present with the same MAC address. It allows live migration
of a VM with a direct attached VF without the need to setup a bond/team
between a VF and virtio net device in the guest.

The hypervisor needs to enable only one datapath at any time so that
packets don't get looped back to the VM over the other datapath. When a VF
is plugged, the virtio datapath link state can be marked as down. The
hypervisor needs to unplug the VF device from the guest on the source host
and reset the MAC filter of the VF to initiate failover of datapath to
virtio before starting the migration. After the migration is completed,
the destination hypervisor sets the MAC filter on the VF and plugs it back
to the guest to switch over to VF datapath.

It uses the generic bypass framework that provides 2 functions to create
and destroy a master bypass netdev. When BACKUP feature is enabled, an
additional netdev(bypass netdev) is created that acts as a master device
and tracks the state of the 2 lower netdevs. The original virtio_net netdev
is marked as 'backup' netdev and a passthru device with the same MAC is
registered as 'active' netdev.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

Signed-off-by: Sridhar Samudrala 
---
 drivers/net/Kconfig  |  1 +
 drivers/net/virtio_net.c | 36 +++-
 2 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 891846655000..9e2cf61fd1c1 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -331,6 +331,7 @@ config VETH
 config VIRTIO_NET
tristate "Virtio network driver"
depends on VIRTIO
+   depends on MAY_USE_BYPASS
---help---
  This is the virtual network driver for virtio.  It can be used with
  QEMU based VMMs (like KVM or Xen).  Say Y or M.
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index befb5944f3fd..99aa52d5ac9b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -30,8 +30,11 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 #include 
 #include 
+#include 
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
@@ -206,6 +209,9 @@ struct virtnet_info {
u32 speed;
 
unsigned long guest_offloads;
+
+   /* bypass_master created when BACKUP feature enabled */
+   struct bypass_master *bypass_master;
 };
 
 struct padded_vnet_hdr {
@@ -2275,6 +2281,22 @@ static int virtnet_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
}
 }
 
+static int virtnet_get_phys_port_name(struct net_device *dev, char *buf,
+ size_t len)
+{
+   struct virtnet_info *vi = netdev_priv(dev);
+   int ret;
+
+   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
+   return -EOPNOTSUPP;
+
+   ret = snprintf(buf, len, "_bkup");
+   if (ret >= len)
+   return -EOPNOTSUPP;
+
+   return 0;
+}
+
 static const struct net_device_ops virtnet_netdev = {
.ndo_open= virtnet_open,
.ndo_stop= virtnet_close,
@@ -2292,6 +2314,7 @@ static const struct net_device_ops virtnet_netdev = {
.ndo_xdp_xmit   = virtnet_xdp_xmit,
.ndo_xdp_flush  = virtnet_xdp_flush,
.ndo_features_check = passthru_features_check,
+   .ndo_get_phys_port_name = virtnet_get_phys_port_name,
 };
 
 static void virtnet_config_changed_work(struct work_struct *work)
@@ -2839,10 +2862,16 @@ static int virtnet_probe(struct virtio_device *vdev)
 
virtnet_init_settings(dev);
 
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
+   err = bypass_master_create(vi->dev, &vi->bypass_master);
+   if (err)
+   goto free_vqs;
+   }
+
err = register_netdev(dev);
if (err) {
pr_debug("virtio_net: registering device failed\n");
-   goto free_vqs;
+   goto free_bypass;
}
 
virtio_device_ready(vdev);
@@ -2879,6 +2908,8 @@ static int virtnet_probe(struct virtio_device *vdev)
vi->vdev->config->reset(vdev);
 
unregister_netdev(dev);
+free_bypass:
+   bypass_master_destroy(vi->bypass_master);
 free_vqs:
cancel_delayed_work_sync(&vi->refill);
free_receive_page_frags(vi);
@@ -2913,6 +2944,8 @@ static void virtnet_remove(struct virtio_device *vdev)
 
unregister_netdev(vi->dev);
 
+   bypass_master_destroy(vi->bypass_master);
+
remove_vq_common(vi);
 
free_netdev(vi->dev);
@@ -3010,6 +3043,7 @@ static __init int virtio_net_driver_init(void)
 ret = register_virtio_driver(&virtio_net_driver);
if (ret)
goto err_virtio;
+
return 0;
 err_virtio:
cpuhp_remove_multi_sta

[RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework

2018-04-10 Thread Sridhar Samudrala
Use the registration/notification framework supported by the generic
bypass infrastructure.

Signed-off-by: Sridhar Samudrala 
---
 drivers/net/hyperv/Kconfig  |   1 +
 drivers/net/hyperv/hyperv_net.h |   2 +
 drivers/net/hyperv/netvsc_drv.c | 208 ++--
 3 files changed, 55 insertions(+), 156 deletions(-)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 936968d23559..cc3a721baa18 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -1,5 +1,6 @@
 config HYPERV_NET
tristate "Microsoft Hyper-V virtual network driver"
depends on HYPERV
+   depends on MAY_USE_BYPASS
help
  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 960f06141472..5f8137bc5c1c 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -768,6 +768,8 @@ struct net_device_context {
u32 vf_alloc;
/* Serial number of the VF to team with */
u32 vf_serial;
+
+   struct bypass_master *bypass_master;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index ecc84954c511..87c2a276e62f 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,6 +43,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "hyperv_net.h"
 
@@ -1763,46 +1764,6 @@ static void netvsc_link_change(struct work_struct *w)
rtnl_unlock();
 }
 
-static struct net_device *get_netvsc_bymac(const u8 *mac)
-{
-   struct net_device *dev;
-
-   ASSERT_RTNL();
-
-   for_each_netdev(&init_net, dev) {
-   if (dev->netdev_ops != &device_ops)
-   continue;   /* not a netvsc device */
-
-   if (ether_addr_equal(mac, dev->perm_addr))
-   return dev;
-   }
-
-   return NULL;
-}
-
-static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
-{
-   struct net_device *dev;
-
-   ASSERT_RTNL();
-
-   for_each_netdev(&init_net, dev) {
-   struct net_device_context *net_device_ctx;
-
-   if (dev->netdev_ops != &device_ops)
-   continue;   /* not a netvsc device */
-
-   net_device_ctx = netdev_priv(dev);
-   if (!rtnl_dereference(net_device_ctx->nvdev))
-   continue;   /* device is removed */
-
-   if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-   return dev; /* a match */
-   }
-
-   return NULL;
-}
-
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1829,39 +1790,15 @@ static int netvsc_vf_join(struct net_device *vf_netdev,
  struct net_device *ndev)
 {
struct net_device_context *ndev_ctx = netdev_priv(ndev);
-   int ret;
-
-   ret = netdev_rx_handler_register(vf_netdev,
-netvsc_vf_handle_frame, ndev);
-   if (ret != 0) {
-   netdev_err(vf_netdev,
-  "can not register netvsc VF receive handler (err = 
%d)\n",
-  ret);
-   goto rx_handler_failed;
-   }
-
-   ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
-   if (ret != 0) {
-   netdev_err(vf_netdev,
-  "can not set master device %s (err = %d)\n",
-  ndev->name, ret);
-   goto upper_link_failed;
-   }
-
-   /* set slave flag before open to prevent IPv6 addrconf */
-   vf_netdev->flags |= IFF_SLAVE;
 
schedule_delayed_work(&ndev_ctx->vf_takeover, VF_TAKEOVER_INT);
 
-   call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
-
netdev_info(vf_netdev, "joined to %s\n", ndev->name);
-   return 0;
 
-upper_link_failed:
-   netdev_rx_handler_unregister(vf_netdev);
-rx_handler_failed:
-   return ret;
+   dev_hold(vf_netdev);
+   rcu_assign_pointer(ndev_ctx->vf_netdev, vf_netdev);
+
+   return 0;
 }
 
 static void __netvsc_vf_setup(struct net_device *ndev,
@@ -1914,85 +1851,82 @@ static void netvsc_vf_setup(struct work_struct *w)
rtnl_unlock();
 }
 
-static int netvsc_register_vf(struct net_device *vf_netdev)
+static int netvsc_vf_pre_register(struct net_device *vf_netdev,
+ struct net_device *ndev)
 {
-   struct net_device *ndev;
struct net_device_context *net_device_ctx;
struct netvsc_device *netvsc_dev;
 
-   if (vf_netdev->addr_len != ETH_ALEN)
-   return NOTIFY_DONE;
-
-   /*
-* We will use the MAC address to locate the synthetic interface to
-* associate with the VF interface. If we don't find a matching
-* s

[RFC PATCH net-next v6 1/4] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-04-10 Thread Sridhar Samudrala
This feature bit can be used by hypervisor to indicate virtio_net device to
act as a backup for another device with the same MAC address.

VIRTIO_NET_F_BACKUP is defined as bit 62 as it is a device feature bit.

Signed-off-by: Sridhar Samudrala 
---
 drivers/net/virtio_net.c| 2 +-
 include/uapi/linux/virtio_net.h | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec7411e..befb5944f3fd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2962,7 +2962,7 @@ static struct virtio_device_id id_table[] = {
VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
VIRTIO_NET_F_CTRL_MAC_ADDR, \
VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
-   VIRTIO_NET_F_SPEED_DUPLEX
+   VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
 
 static unsigned int features[] = {
VIRTNET_FEATURES,
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
index 5de6ed37695b..c7c35fd1a5ed 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -57,6 +57,9 @@
 * Steering */
 #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
 
+#define VIRTIO_NET_F_BACKUP  62/* Act as backup for another device
+* with the same MAC.
+*/
 #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
 
 #ifndef VIRTIO_NET_NO_LEGACY
-- 
2.14.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[RFC PATCH net-next v6 0/4] Enable virtio_net to act as a backup for a passthru device

2018-04-10 Thread Sridhar Samudrala
The main motivation for this patch is to enable cloud service providers
to provide an accelerated datapath to virtio-net enabled VMs in a 
transparent manner with no/minimal guest userspace changes. This also
enables hypervisor controlled live migration to be supported with VMs that
have direct attached SR-IOV VF devices.

Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Patch 2 introduces a bypass module that provides a generic interface for 
paravirtual drivers to listen for netdev register/unregister/link change
events from pci ethernet devices with the same MAC and takeover their
datapath. The notifier and event handling code is based on the existing
netvsc implementation. It provides 2 sets of interfaces to paravirtual 
drivers to support 2-netdev(netvsc) and 3-netdev(virtio_net) models.

Patch 3 extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

Patch 4 refactors netvsc to use the registration/notification framework
supported by bypass module.

As this patch series is initially focusing on usecases where hypervisor 
fully controls the VM networking and the guest is not expected to directly 
configure any hardware settings, it doesn't expose all the ndo/ethtool ops
that are supported by virtio_net at this time. To support additional usecases,
it should be possible to enable additional ops later by caching the state
in virtio netdev and replaying when the 'active' netdev gets registered. 
 
The hypervisor needs to enable only one datapath at any time so that packets
don't get looped back to the VM over the other datapath. When a VF is
plugged, the virtio datapath link state can be marked as down.
At the time of live migration, the hypervisor needs to unplug the VF device
from the guest on the source host and reset the MAC filter of the VF to
initiate failover of datapath to virtio before starting the migration. After
the migration is completed, the destination hypervisor sets the MAC filter
on the VF and plugs it back to the guest to switch over to VF datapath.

This patch is based on the discussion initiated by Jesse on this thread.
https://marc.info/?l=linux-virtualization&m=151189725224231&w=2

v6 RFC:
  Simplified virtio_net changes by moving all the ndo_ops of the 
  bypass_netdev and create/destroy of bypass_netdev to 'bypass' module.
  avoided 2 phase registration(driver + instances).
  introduced IFF_BYPASS/IFF_BYPASS_SLAVE dev->priv_flags 
  replaced mutex with a spinlock

v5 RFC:
  Based on Jiri's comments, moved the common functionality to a 'bypass'
  module so that the same notifier and event handlers to handle child
  register/unregister/link change events can be shared between virtio_net
  and netvsc.
  Improved error handling based on Siwei's comments.
v4:
- Based on the review comments on the v3 version of the RFC patch and
  Jakub's suggestion for the naming issue with 3 netdev solution,
  proposed 3 netdev in-driver bonding solution for virtio-net.
v3 RFC:
- Introduced 3 netdev model and pointed out a couple of issues with
  that model and proposed 2 netdev model to avoid these issues.
- Removed broadcast/multicast optimization and only use virtio as
  backup path when VF is unplugged.
v2 RFC:
- Changed VIRTIO_NET_F_MASTER to VIRTIO_NET_F_BACKUP (mst)
- made a small change to the virtio-net xmit path to only use VF datapath
  for unicasts. Broadcasts/multicasts use virtio datapath. This avoids
  east-west broadcasts to go over the PCI link.
- added suppport for the feature bit in qemu

Sridhar Samudrala (4):
  virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
  net: Introduce generic bypass module
  virtio_net: Extend virtio to use VF datapath when available
  netvsc: refactor notifier/event handling code to use the bypass
framework

 drivers/net/Kconfig |   1 +
 drivers/net/hyperv/Kconfig  |   1 +
 drivers/net/hyperv/netvsc_drv.c | 219 --
 drivers/net/virtio_net.c| 614 +++-
 include/net/bypass.h|  80 ++
 include/uapi/linux/virtio_net.h |   3 +
 net/Kconfig |  18 ++
 net/core/Makefile   |   1 +
 net/core/bypass.c   | 406 ++
 9 files changed, 1184 insertions(+), 159 deletions(-)
 create mode 100644 include/net/bypass.

Re: [PATCH v29 1/4] mm: support reporting free page blocks

2018-04-10 Thread Michael S. Tsirkin
On Mon, Mar 26, 2018 at 02:22:54PM -0700, Andrew Morton wrote:
> On Mon, 26 Mar 2018 10:39:51 +0800 Wei Wang  wrote:
> 
> > This patch adds support to walk through the free page blocks in the
> > system and report them via a callback function. Some page blocks may
> > leave the free list after zone->lock is released, so it is the caller's
> > responsibility to either detect or prevent the use of such pages.
> > 
> > One use example of this patch is to accelerate live migration by skipping
> > the transfer of free pages reported from the guest. A popular method used
> > by the hypervisor to track which part of memory is written during live
> > migration is to write-protect all the guest memory. So, those pages that
> > are reported as free pages but are written after the report function
> > returns will be captured by the hypervisor, and they will be added to the
> > next round of memory transfer.
> > 
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -4912,6 +4912,102 @@ void show_free_areas(unsigned int filter, 
> > nodemask_t *nodemask)
> > show_swap_cache_info();
> >  }
> >  
> > +/*
> > + * Walk through a free page list and report the found pfn range via the
> > + * callback.
> > + *
> > + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> > + * value returned from the callback.
> > + */
> > +static int walk_free_page_list(void *opaque,
> > +  struct zone *zone,
> > +  int order,
> > +  enum migratetype mt,
> > +  int (*report_pfn_range)(void *,
> > +  unsigned long,
> > +  unsigned long))
> > +{
> > +   struct page *page;
> > +   struct list_head *list;
> > +   unsigned long pfn, flags;
> > +   int ret = 0;
> > +
> > +   spin_lock_irqsave(&zone->lock, flags);
> > +   list = &zone->free_area[order].free_list[mt];
> > +   list_for_each_entry(page, list, lru) {
> > +   pfn = page_to_pfn(page);
> > +   ret = report_pfn_range(opaque, pfn, 1 << order);
> > +   if (ret)
> > +   break;
> > +   }
> > +   spin_unlock_irqrestore(&zone->lock, flags);
> > +
> > +   return ret;
> > +}
> > +
> > +/**
> > + * walk_free_mem_block - Walk through the free page blocks in the system
> > + * @opaque: the context passed from the caller
> > + * @min_order: the minimum order of free lists to check
> > + * @report_pfn_range: the callback to report the pfn range of the free 
> > pages
> > + *
> > + * If the callback returns a non-zero value, stop iterating the list of 
> > free
> > + * page blocks. Otherwise, continue to report.
> > + *
> > + * Please note that there are no locking guarantees for the callback and
> > + * that the reported pfn range might be freed or disappear after the
> > + * callback returns so the caller has to be very careful how it is used.
> > + *
> > + * The callback itself must not sleep or perform any operations which would
> > + * require any memory allocations directly (not even GFP_NOWAIT/GFP_ATOMIC)
> > + * or via any lock dependency. It is generally advisable to implement
> > + * the callback as simple as possible and defer any heavy lifting to a
> > + * different context.
> > + *
> > + * There is no guarantee that each free range will be reported only once
> > + * during one walk_free_mem_block invocation.
> > + *
> > + * pfn_to_page on the given range is strongly discouraged and if there is
> > + * an absolute need for that make sure to contact MM people to discuss
> > + * potential problems.
> > + *
> > + * The function itself might sleep so it cannot be called from atomic
> > + * contexts.
> 
> I don't see how walk_free_mem_block() can sleep.
> 
> > + * In general low orders tend to be very volatile and so it makes more
> > + * sense to query larger ones first for various optimizations which like
> > + * ballooning etc... This will reduce the overhead as well.
> > + *
> > + * Return 0 if it completes the reporting. Otherwise, return the non-zero
> > + * value returned from the callback.
> > + */
> > +int walk_free_mem_block(void *opaque,
> > +   int min_order,
> > +   int (*report_pfn_range)(void *opaque,
> > +   unsigned long pfn,
> > +   unsigned long num))
> > +{
> > +   struct zone *zone;
> > +   int order;
> > +   enum migratetype mt;
> > +   int ret;
> > +
> > +   for_each_populated_zone(zone) {
> > +   for (order = MAX_ORDER - 1; order >= min_order; order--) {
> > +   for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> > +   ret = walk_free_page_list(opaque, zone,
> > + order, mt,
> > + report_pfn_range);
> > +   if (ret)
> > +   return ret;
> > +  

Re: [virtio-dev] [RFC] virtio-iommu version 0.6

2018-04-10 Thread Jean-Philippe Brucker
On 22/03/18 09:44, Tian, Kevin wrote:
>> From: Jean-Philippe Brucker [mailto:jean-philippe.bruc...@arm.com]
>> Sent: Wednesday, March 21, 2018 9:14 PM
>>
>> Hi Kevin,
>>
>> Thanks for the comments
>>
>> On 19/03/18 10:03, Tian, Kevin wrote:
>>> BYPASS feature bit is not covered in "2.3.1/2.3.2/2.3.3"". Is it
>>> intended?
>>
>> In my opinion BYPASS is a bit different from the other features: while the
>> others are needed for correctness, this one is optional and even if the
>> guest supports BYPASS, it should be allowed not to accept it. For security
>> reasons it may not want to let endpoints access the whole guest-physical
>> address space.
> 
> ok, possibly because I'm not familiar with virtio spec convention.
> My original feeling is that each feature bit will have a behavior
> description regarding to whether device reports and whether 
> driver accepts. If no need to cover optional feature, then it's fine
> to me. :-)

I think the virtio spec allows the reader to choose how to implement any
behavior that isn't explicitly described. Whenever a SHOULD or MUST
isn't stated in a requirement section, the spec implies "feature X MAY
be enabled if offered".

Looking at the virtio-net specification, optional features such as
VIRTIO_NET_F_VLAN aren't mentioned in driver requirements. On the other
hand, features that the device needs in order to function properly are
explicitly stated, for example VIRTIO_NET_F_MAC.

>> [...]
>>> Then comes a question for VIRTIO_IOMMU_RESV_MEM_T_MSI.
>>> I know there were quite some discussion around this flag before,
>>> but my mental picture now still is a bit difficult to understand its
>>> usage based on examples in implementation notes:
>>>
>>> - for x86, you describe it as indicating MSI bypass for
>>> oxfeex. However guest doesn't need to know this fact. Only
>>> requirement is to treat it as reserved range (as on bare metal)
>>> then T_RESERVED is sufficient for this purpose>
>>> - for ARM, either let guest or let host to choose a virtual
>>> address for mapping to MSI doorbell. the former doesn't require
>>> a reserved range. for the latter also T_RESERVED is enough as
>>> the example in hardware device assignment section.
>>
>> It might be nicer to have the host decide it, but when the physical IOMMU
>> is ARM SMMU, nesting translation complicates things, because the guest
>> *has* to create a mapping:
> 
> confirm one thing first. v0.6 doesn't support binding to guest IOVA
> page table yet. So based on current map/unmap interface, there is 
> no such complicity right? stage-1 is just a shadowed translation (IOVA
> ->HPA) to guest side (IOVA->GPA) with nesting disabled. In that case
> the default behavior is host-reserved style.

Yes, with v0.6 the host chooses whether to translate or bypass the MSIs.

> Then comes nested scenario:
> 
>>
>> * The guest is in charge of stage-1. It creates IOVA->GPA mapping for the
>>   MSI doorbell. The GPA is mapped to the physical MSI doorbell at
>>   stage-2 by the host.
> 
> Is it a must that above GPA is mapped to physical MSI doorbell?
> 
> Ideally:
> 
> 1) Host reserves some IOVA range for mapping MSI doorbell
> 2) the range is reported to user space
> 3) Qemu reported the range as reserved on endpoints, marked
> as T_IDENTITY (a new type to be introduced), meaning guest
> needs setup identity mapping in stage-1

I think it works, and is saner than my idea. I had one concern (thought
I had more but can't find any in my notes; they might resurface when
prototyping): the MSI doorbell has to be mapped with device attributes
(MMIO) to ensure proper interrupt delivery. I think it's fine because
the host maps the doorbell with the right attributes at stage-2, which
will take precedence over what the guest uses for stage-1 according to
the Arm architecture.

So I don't think T_IDENTITY requires arch-specific attributes for now
but there should definitely be space for extension. I also noticed that
Intel and AMD drivers in Linux don't add any special attribute or
protection to identity mappings (RESV_DIRECT in Linux), apart from
READ/WRITE.

For reference the previous discussion about T_IDENTITY is here:
https://www.spinics.net/lists/kvm/msg155240.html

One problem with my idea was that if the device is behind multiple IRQ
chips, the host cannot know which IRQ chip the IOVA corresponds to, when
reading the MSI-X tables. Then again I don't know why multiple IRQ chips
per device would be desirable, but it's allowed.

> 4) Then device should be able to ring physical MSI doorbell
> 5) I'm not sure whether guest still needs to allocate its own
> IOVA and mapping to vGIC doorbell in such case...

I don't think it matters, because the host MSI infrastructure is
decoupled from the guest's. The host programs the physical MSI-X tables
and only needs the guest to create a stage-1 mappings for those, which
is done with T_IDENTITY. The guest doesn't know what the T_IDENTITY
mapping is for.

Then QEMU can choose whether to map or bypass 

Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Samudrala, Sridhar

On 4/10/2018 8:43 AM, Jiri Pirko wrote:

Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com wrote:

On 4/10/2018 8:22 AM, Jiri Pirko wrote:

Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:

On 4/10/2018 3:55 AM, Jiri Pirko wrote:

Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:

On 4/9/2018 1:07 AM, Jiri Pirko wrote:

Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:

On 4/6/2018 5:48 AM, Jiri Pirko wrote:

Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:

[...]


+static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
+struct net_device *child_netdev)
+{
+   struct virtnet_bypass_info *vbi;
+   bool backup;
+
+   vbi = netdev_priv(bypass_netdev);
+   backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
+   if (backup ? rtnl_dereference(vbi->backup_netdev) :
+   rtnl_dereference(vbi->active_netdev)) {
+   netdev_info(bypass_netdev,
+   "%s attempting to join bypass dev when %s already 
present\n",
+   child_netdev->name, backup ? "backup" : "active");

Bypass module should check if there is already some other netdev
enslaved and refuse right there.

This will work for virtio-net with 3 netdev model, but this check has to be 
done by netvsc
as its bypass_netdev is same as the backup_netdev.
Will add a flag while registering with the bypass module to indicate if the 
driver is doing
a 2 netdev or 3 netdev model and based on that flag this check can be done in 
bypass module
for 3 netdev scenario.

Just let me undestand it clearly. What I expect the difference would be
between 2netdev and3 netdev model is this:
2netdev:
  bypass_master
 /
/
VF_slave

3netdev:
  bypass_master
 / \
/   \
VF_slave   backup_slave

Is that correct? If not, how does it look like?



Looks correct.
VF_slave and backup_slave are the original netdevs and are present in both the 
models.
In the 3 netdev model, bypass_master netdev is created and VF_slave and 
backup_slave are
marked as the 2 slaves of this new netdev.

You say it looks correct and in another sentence you provide completely
different description. Could you please look again?


To be exact, 2 netdev model with netvsc looks like this.

 netvsc_netdev
   /
  /
VF_slave

With virtio_net, 3 netdev model

   bypass_netdev
   / \
  /   \
VF_slave   virtio_net netdev

Could you also mark the original netdev which is there now? is it
bypass_netdev or virtio_net_netdev ?

bypass_netdev
 / \
/   \
VF_slave   virtio_net netdev (original)

That does not make sense.
1) You diverge from the behaviour of the netvsc, where the original
netdev is a master of the VF
2) If the original netdev is a slave, you cannot have any IP address
configured on it (well you could, but the rx_handler would eat every
incoming packet). So you will break the user bacause he would have to
move the configuration to the new master device.
This only makes sense that the original netdev becomes the master for both
netvsc and virtio_net.
Forgot to mention that bypass_netdev takes over the name of the original 
netdev and

virtio_net netdev will get the backup name.
So the userspace network configuration doesn't need to change.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Siwei Liu
On Tue, Apr 10, 2018 at 8:43 AM, Jiri Pirko  wrote:
> Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com wrote:
>>On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:
>>> > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>>> > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:
>>> > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>>> > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com 
>>> > > > > wrote:
>>> > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>>> > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, 
>>> > > > > > > sridhar.samudr...@intel.com wrote:
>>> > > > > [...]
>>> > > > >
>>> > > > > > > > +static int virtnet_bypass_join_child(struct net_device 
>>> > > > > > > > *bypass_netdev,
>>> > > > > > > > +   struct net_device 
>>> > > > > > > > *child_netdev)
>>> > > > > > > > +{
>>> > > > > > > > +  struct virtnet_bypass_info *vbi;
>>> > > > > > > > +  bool backup;
>>> > > > > > > > +
>>> > > > > > > > +  vbi = netdev_priv(bypass_netdev);
>>> > > > > > > > +  backup = (child_netdev->dev.parent == 
>>> > > > > > > > bypass_netdev->dev.parent);
>>> > > > > > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>>> > > > > > > > +  rtnl_dereference(vbi->active_netdev)) {
>>> > > > > > > > +  netdev_info(bypass_netdev,
>>> > > > > > > > +  "%s attempting to join bypass dev 
>>> > > > > > > > when %s already present\n",
>>> > > > > > > > +  child_netdev->name, backup ? 
>>> > > > > > > > "backup" : "active");
>>> > > > > > > Bypass module should check if there is already some other netdev
>>> > > > > > > enslaved and refuse right there.
>>> > > > > > This will work for virtio-net with 3 netdev model, but this check 
>>> > > > > > has to be done by netvsc
>>> > > > > > as its bypass_netdev is same as the backup_netdev.
>>> > > > > > Will add a flag while registering with the bypass module to 
>>> > > > > > indicate if the driver is doing
>>> > > > > > a 2 netdev or 3 netdev model and based on that flag this check 
>>> > > > > > can be done in bypass module
>>> > > > > > for 3 netdev scenario.
>>> > > > > Just let me undestand it clearly. What I expect the difference 
>>> > > > > would be
>>> > > > > between 2netdev and3 netdev model is this:
>>> > > > > 2netdev:
>>> > > > >  bypass_master
>>> > > > > /
>>> > > > >/
>>> > > > > VF_slave
>>> > > > >
>>> > > > > 3netdev:
>>> > > > >  bypass_master
>>> > > > > / \
>>> > > > >/   \
>>> > > > > VF_slave   backup_slave
>>> > > > >
>>> > > > > Is that correct? If not, how does it look like?
>>> > > > >
>>> > > > >
>>> > > > Looks correct.
>>> > > > VF_slave and backup_slave are the original netdevs and are present in 
>>> > > > both the models.
>>> > > > In the 3 netdev model, bypass_master netdev is created and VF_slave 
>>> > > > and backup_slave are
>>> > > > marked as the 2 slaves of this new netdev.
>>> > > You say it looks correct and in another sentence you provide completely
>>> > > different description. Could you please look again?
>>> > >
>>> > To be exact, 2 netdev model with netvsc looks like this.
>>> >
>>> > netvsc_netdev
>>> >   /
>>> >  /
>>> > VF_slave
>>> >
>>> > With virtio_net, 3 netdev model
>>> >
>>> >   bypass_netdev
>>> >   / \
>>> >  /   \
>>> > VF_slave   virtio_net netdev
>>> Could you also mark the original netdev which is there now? is it
>>> bypass_netdev or virtio_net_netdev ?
>>
>> bypass_netdev
>> / \
>>/   \
>>VF_slave   virtio_net netdev (original)
>
> That does not make sense.
> 1) You diverge from the behaviour of the netvsc, where the original
>netdev is a master of the VF
> 2) If the original netdev is a slave, you cannot have any IP address
>configured on it (well you could, but the rx_handler would eat every
>incoming packet). So you will break the user bacause he would have to
>move the configuration to the new master device.

That's exactly the point why I need to hide the lower netdev slaves
and trying the align the naming of the bypass with where IP was
configured on the original netdev. The current 3-netdev model is not
"transparent" by any means.

-Siwei

> This only makes sense that the original netdev becomes the master for both
> netvsc and virtio_net.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Jiri Pirko
Tue, Apr 10, 2018 at 05:27:48PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/10/2018 8:22 AM, Jiri Pirko wrote:
>> Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> > > Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:
>> > > > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com 
>> > > > > wrote:
>> > > > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, 
>> > > > > > > sridhar.samudr...@intel.com wrote:
>> > > > > [...]
>> > > > > 
>> > > > > > > > +static int virtnet_bypass_join_child(struct net_device 
>> > > > > > > > *bypass_netdev,
>> > > > > > > > +   struct net_device 
>> > > > > > > > *child_netdev)
>> > > > > > > > +{
>> > > > > > > > +  struct virtnet_bypass_info *vbi;
>> > > > > > > > +  bool backup;
>> > > > > > > > +
>> > > > > > > > +  vbi = netdev_priv(bypass_netdev);
>> > > > > > > > +  backup = (child_netdev->dev.parent == 
>> > > > > > > > bypass_netdev->dev.parent);
>> > > > > > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > > > +  rtnl_dereference(vbi->active_netdev)) {
>> > > > > > > > +  netdev_info(bypass_netdev,
>> > > > > > > > +  "%s attempting to join bypass dev 
>> > > > > > > > when %s already present\n",
>> > > > > > > > +  child_netdev->name, backup ? 
>> > > > > > > > "backup" : "active");
>> > > > > > > Bypass module should check if there is already some other netdev
>> > > > > > > enslaved and refuse right there.
>> > > > > > This will work for virtio-net with 3 netdev model, but this check 
>> > > > > > has to be done by netvsc
>> > > > > > as its bypass_netdev is same as the backup_netdev.
>> > > > > > Will add a flag while registering with the bypass module to 
>> > > > > > indicate if the driver is doing
>> > > > > > a 2 netdev or 3 netdev model and based on that flag this check can 
>> > > > > > be done in bypass module
>> > > > > > for 3 netdev scenario.
>> > > > > Just let me undestand it clearly. What I expect the difference would 
>> > > > > be
>> > > > > between 2netdev and3 netdev model is this:
>> > > > > 2netdev:
>> > > > >  bypass_master
>> > > > > /
>> > > > >/
>> > > > > VF_slave
>> > > > > 
>> > > > > 3netdev:
>> > > > >  bypass_master
>> > > > > / \
>> > > > >/   \
>> > > > > VF_slave   backup_slave
>> > > > > 
>> > > > > Is that correct? If not, how does it look like?
>> > > > > 
>> > > > > 
>> > > > Looks correct.
>> > > > VF_slave and backup_slave are the original netdevs and are present in 
>> > > > both the models.
>> > > > In the 3 netdev model, bypass_master netdev is created and VF_slave 
>> > > > and backup_slave are
>> > > > marked as the 2 slaves of this new netdev.
>> > > You say it looks correct and in another sentence you provide completely
>> > > different description. Could you please look again?
>> > > 
>> > To be exact, 2 netdev model with netvsc looks like this.
>> > 
>> > netvsc_netdev
>> >   /
>> >  /
>> > VF_slave
>> > 
>> > With virtio_net, 3 netdev model
>> > 
>> >   bypass_netdev
>> >   / \
>> >  /   \
>> > VF_slave   virtio_net netdev
>> Could you also mark the original netdev which is there now? is it
>> bypass_netdev or virtio_net_netdev ?
>
> bypass_netdev
> / \
>/   \
>VF_slave   virtio_net netdev (original)

That does not make sense.
1) You diverge from the behaviour of the netvsc, where the original
   netdev is a master of the VF
2) If the original netdev is a slave, you cannot have any IP address
   configured on it (well you could, but the rx_handler would eat every
   incoming packet). So you will break the user bacause he would have to
   move the configuration to the new master device.
This only makes sense that the original netdev becomes the master for both
netvsc and virtio_net.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Samudrala, Sridhar

On 4/10/2018 8:22 AM, Jiri Pirko wrote:

Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:

On 4/10/2018 3:55 AM, Jiri Pirko wrote:

Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:

On 4/9/2018 1:07 AM, Jiri Pirko wrote:

Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:

On 4/6/2018 5:48 AM, Jiri Pirko wrote:

Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:

[...]


+static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
+struct net_device *child_netdev)
+{
+   struct virtnet_bypass_info *vbi;
+   bool backup;
+
+   vbi = netdev_priv(bypass_netdev);
+   backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
+   if (backup ? rtnl_dereference(vbi->backup_netdev) :
+   rtnl_dereference(vbi->active_netdev)) {
+   netdev_info(bypass_netdev,
+   "%s attempting to join bypass dev when %s already 
present\n",
+   child_netdev->name, backup ? "backup" : "active");

Bypass module should check if there is already some other netdev
enslaved and refuse right there.

This will work for virtio-net with 3 netdev model, but this check has to be 
done by netvsc
as its bypass_netdev is same as the backup_netdev.
Will add a flag while registering with the bypass module to indicate if the 
driver is doing
a 2 netdev or 3 netdev model and based on that flag this check can be done in 
bypass module
for 3 netdev scenario.

Just let me undestand it clearly. What I expect the difference would be
between 2netdev and3 netdev model is this:
2netdev:
 bypass_master
/
   /
VF_slave

3netdev:
 bypass_master
/ \
   /   \
VF_slave   backup_slave

Is that correct? If not, how does it look like?



Looks correct.
VF_slave and backup_slave are the original netdevs and are present in both the 
models.
In the 3 netdev model, bypass_master netdev is created and VF_slave and 
backup_slave are
marked as the 2 slaves of this new netdev.

You say it looks correct and in another sentence you provide completely
different description. Could you please look again?


To be exact, 2 netdev model with netvsc looks like this.

netvsc_netdev
  /
 /
VF_slave

With virtio_net, 3 netdev model

  bypass_netdev
  / \
 /   \
VF_slave   virtio_net netdev

Could you also mark the original netdev which is there now? is it
bypass_netdev or virtio_net_netdev ?


 bypass_netdev
 / \
/   \
VF_slave   virtio_net netdev (original)




___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Jiri Pirko
Tue, Apr 10, 2018 at 05:13:40PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/10/2018 3:55 AM, Jiri Pirko wrote:
>> Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> > > Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:
>> > > > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com 
>> > > > > wrote:
>> > > [...]
>> > > 
>> > > > > > +static int virtnet_bypass_join_child(struct net_device 
>> > > > > > *bypass_netdev,
>> > > > > > +   struct net_device *child_netdev)
>> > > > > > +{
>> > > > > > +  struct virtnet_bypass_info *vbi;
>> > > > > > +  bool backup;
>> > > > > > +
>> > > > > > +  vbi = netdev_priv(bypass_netdev);
>> > > > > > +  backup = (child_netdev->dev.parent == 
>> > > > > > bypass_netdev->dev.parent);
>> > > > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > > > +  rtnl_dereference(vbi->active_netdev)) {
>> > > > > > +  netdev_info(bypass_netdev,
>> > > > > > +  "%s attempting to join bypass dev when %s 
>> > > > > > already present\n",
>> > > > > > +  child_netdev->name, backup ? "backup" : 
>> > > > > > "active");
>> > > > > Bypass module should check if there is already some other netdev
>> > > > > enslaved and refuse right there.
>> > > > This will work for virtio-net with 3 netdev model, but this check has 
>> > > > to be done by netvsc
>> > > > as its bypass_netdev is same as the backup_netdev.
>> > > > Will add a flag while registering with the bypass module to indicate 
>> > > > if the driver is doing
>> > > > a 2 netdev or 3 netdev model and based on that flag this check can be 
>> > > > done in bypass module
>> > > > for 3 netdev scenario.
>> > > Just let me undestand it clearly. What I expect the difference would be
>> > > between 2netdev and3 netdev model is this:
>> > > 2netdev:
>> > > bypass_master
>> > >/
>> > >   /
>> > > VF_slave
>> > > 
>> > > 3netdev:
>> > > bypass_master
>> > >/ \
>> > >   /   \
>> > > VF_slave   backup_slave
>> > > 
>> > > Is that correct? If not, how does it look like?
>> > > 
>> > > 
>> > Looks correct.
>> > VF_slave and backup_slave are the original netdevs and are present in both 
>> > the models.
>> > In the 3 netdev model, bypass_master netdev is created and VF_slave and 
>> > backup_slave are
>> > marked as the 2 slaves of this new netdev.
>> You say it looks correct and in another sentence you provide completely
>> different description. Could you please look again?
>> 
>To be exact, 2 netdev model with netvsc looks like this.
>
>netvsc_netdev
>  /
> /
> VF_slave
>
>With virtio_net, 3 netdev model
>
>  bypass_netdev
>  / \
> /   \
>VF_slave   virtio_net netdev

Could you also mark the original netdev which is there now? is it
bypass_netdev or virtio_net_netdev ?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Samudrala, Sridhar

On 4/10/2018 3:55 AM, Jiri Pirko wrote:

Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:

On 4/9/2018 1:07 AM, Jiri Pirko wrote:

Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:

On 4/6/2018 5:48 AM, Jiri Pirko wrote:

Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:

[...]


+static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
+struct net_device *child_netdev)
+{
+   struct virtnet_bypass_info *vbi;
+   bool backup;
+
+   vbi = netdev_priv(bypass_netdev);
+   backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
+   if (backup ? rtnl_dereference(vbi->backup_netdev) :
+   rtnl_dereference(vbi->active_netdev)) {
+   netdev_info(bypass_netdev,
+   "%s attempting to join bypass dev when %s already 
present\n",
+   child_netdev->name, backup ? "backup" : "active");

Bypass module should check if there is already some other netdev
enslaved and refuse right there.

This will work for virtio-net with 3 netdev model, but this check has to be 
done by netvsc
as its bypass_netdev is same as the backup_netdev.
Will add a flag while registering with the bypass module to indicate if the 
driver is doing
a 2 netdev or 3 netdev model and based on that flag this check can be done in 
bypass module
for 3 netdev scenario.

Just let me undestand it clearly. What I expect the difference would be
between 2netdev and3 netdev model is this:
2netdev:
bypass_master
   /
  /
VF_slave

3netdev:
bypass_master
   / \
  /   \
VF_slave   backup_slave

Is that correct? If not, how does it look like?



Looks correct.
VF_slave and backup_slave are the original netdevs and are present in both the 
models.
In the 3 netdev model, bypass_master netdev is created and VF_slave and 
backup_slave are
marked as the 2 slaves of this new netdev.

You say it looks correct and in another sentence you provide completely
different description. Could you please look again?


To be exact, 2 netdev model with netvsc looks like this.

netvsc_netdev
  /
 /
 VF_slave

With virtio_net, 3 netdev model

  bypass_netdev
  / \
 /   \
VF_slave   virtio_net netdev

 



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check

2018-04-10 Thread David Miller
From: Jason Wang 
Date: Tue, 10 Apr 2018 14:40:10 +0800

> On 2018年04月10日 13:26, Stefan Hajnoczi wrote:
>> v2:
>>   * Rewrote the conditional to make the vq access check clearer [Linus]
>>   * Added Patch 2 to make the return type consistent and harder to misuse
>>   * [Linus]
>>
>> The first patch fixes the vhost virtqueue access check which was
>> recently
>> broken.  The second patch replaces the int return type with bool to
>> prevent
>> future bugs.
>>
>> Stefan Hajnoczi (2):
>>vhost: fix vhost_vq_access_ok() log check
>>vhost: return bool from *_access_ok() functions
>>
>>   drivers/vhost/vhost.h |  4 +--
>>   drivers/vhost/vhost.c | 70
>>   ++-
>>   2 files changed, 38 insertions(+), 36 deletions(-)
>>
> 
> Acked-by: Jason Wang 

This sereis doesn't apply cleanly to the net tree, please respin
and add the appropriate Acks and Fixes: tags that Michael and Jason
have provided.

Thank you.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Liang, Cunming


> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Tuesday, April 10, 2018 9:36 PM
> To: Liang, Cunming 
> Cc: Paolo Bonzini ; Bie, Tiwei ;
> Jason Wang ; alex.william...@redhat.com;
> ddut...@redhat.com; Duyck, Alexander H ;
> virtio-...@lists.oasis-open.org; linux-ker...@vger.kernel.org;
> k...@vger.kernel.org; virtualization@lists.linux-foundation.org;
> net...@vger.kernel.org; Daly, Dan ; Wang, Zhihong
> ; Tan, Jianfeng ; Wang, Xiao
> W 
> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost
> backend
> 
> On Tue, Apr 10, 2018 at 09:23:53AM +, Liang, Cunming wrote:
> >
> >
> > > -Original Message-
> > > From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> > > Sent: Tuesday, April 10, 2018 3:52 PM
> > > To: Bie, Tiwei ; Jason Wang
> > > 
> > > Cc: m...@redhat.com; alex.william...@redhat.com; ddut...@redhat.com;
> > > Duyck, Alexander H ;
> > > virtio-dev@lists.oasis- open.org; linux-ker...@vger.kernel.org;
> > > k...@vger.kernel.org; virtualization@lists.linux-foundation.org;
> > > net...@vger.kernel.org; Daly, Dan ; Liang,
> > > Cunming ; Wang, Zhihong
> > > ; Tan, Jianfeng ;
> > > Wang, Xiao W 
> > > Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based
> > > hardware vhost backend
> > >
> > > On 10/04/2018 06:57, Tiwei Bie wrote:
> > > >> So you just move the abstraction layer from qemu to kernel, and
> > > >> you still need different drivers in kernel for different device
> > > >> interfaces of accelerators. This looks even more complex than
> > > >> leaving it in qemu. As you said, another idea is to implement
> > > >> userspace vhost backend for accelerators which seems easier and
> > > >> could co-work with other parts of qemu without inventing new type of
> messages.
> > > >
> > > > I'm not quite sure. Do you think it's acceptable to add various
> > > > vendor specific hardware drivers in QEMU?
> > >
> > > I think so.  We have vendor-specific quirks, and at some point there
> > > was an idea of using quirks to implement (vendor-specific) live
> > > migration support for assigned devices.
> >
> > Vendor-specific quirks of accessing VGA is a small portion. Other major 
> > portions
> are still handled by guest driver.
> >
> > While in this case, when saying various vendor specific drivers in QEMU, it 
> > says
> QEMU takes over and provides the entire user space device drivers. Some parts
> are even not relevant to vhost, they're basic device function enabling. 
> Moreover,
> it could be different kinds of devices(network/block/...) under vhost. No 
> matter
> # of vendors or # of types, total LOC is not small.
> >
> > The idea is to avoid introducing these extra complexity out of QEMU, keeping
> vhost adapter simple. As vhost protocol is de factor standard, it leverages 
> kernel
> device driver to provide the diversity. Changing once in QEMU, then it 
> supports
> multi-vendor devices whose drivers naturally providing kernel driver there.
> >
> > If QEMU is going to build a user space driver framework there, we're open 
> > mind
> on that, even leveraging DPDK as the underlay library. Looking forward to more
> others' comments from community.
> >
> > Steve
> 
> Dependency on a kernel driver is fine IMHO. It's the dependency on a DPDK
> backend that makes people unhappy, since the functionality in question is 
> setup-
> time only.

Agreed, we don't see dependency on kernel driver is a problem.

mdev based vhost backend (this patch set) is independent with vhost-user 
extension patch set. In fact, there're a few vhost-user providers, DPDK 
librte_vhost is one of them. FD.IO/VPP and snabbswitch have their own 
vhost-user providers. So I can't agree on vhost-user extension patch depends on 
DPDK backend. But anyway, that's the topic of another mail thread.

> 
> As others said, we do not need to go overeboard. A couple of small vendor-
> specific quirks in qemu isn't a big deal.

It's quite challenge to identify it's small or not, there's no uniform metric.

It's only dependent on QEMU itself, that's the obvious benefit. Tradeoff is to 
build the entire device driver. We don't object to do that in QEMU, but wanna 
make sure to understand the boundary size clearly.

> 
> > >
> > > Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2018 at 09:23:53AM +, Liang, Cunming wrote:
> 
> 
> > -Original Message-
> > From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> > Sent: Tuesday, April 10, 2018 3:52 PM
> > To: Bie, Tiwei ; Jason Wang 
> > Cc: m...@redhat.com; alex.william...@redhat.com; ddut...@redhat.com;
> > Duyck, Alexander H ; virtio-dev@lists.oasis-
> > open.org; linux-ker...@vger.kernel.org; k...@vger.kernel.org;
> > virtualization@lists.linux-foundation.org; net...@vger.kernel.org; Daly, Dan
> > ; Liang, Cunming ; Wang,
> > Zhihong ; Tan, Jianfeng ;
> > Wang, Xiao W 
> > Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware
> > vhost backend
> > 
> > On 10/04/2018 06:57, Tiwei Bie wrote:
> > >> So you just move the abstraction layer from qemu to kernel, and you
> > >> still need different drivers in kernel for different device
> > >> interfaces of accelerators. This looks even more complex than leaving
> > >> it in qemu. As you said, another idea is to implement userspace vhost
> > >> backend for accelerators which seems easier and could co-work with
> > >> other parts of qemu without inventing new type of messages.
> > >
> > > I'm not quite sure. Do you think it's acceptable to add various vendor
> > > specific hardware drivers in QEMU?
> > 
> > I think so.  We have vendor-specific quirks, and at some point there was an
> > idea of using quirks to implement (vendor-specific) live migration support 
> > for
> > assigned devices.
> 
> Vendor-specific quirks of accessing VGA is a small portion. Other major 
> portions are still handled by guest driver.
> 
> While in this case, when saying various vendor specific drivers in QEMU, it 
> says QEMU takes over and provides the entire user space device drivers. Some 
> parts are even not relevant to vhost, they're basic device function enabling. 
> Moreover, it could be different kinds of devices(network/block/...) under 
> vhost. No matter # of vendors or # of types, total LOC is not small.
> 
> The idea is to avoid introducing these extra complexity out of QEMU, keeping 
> vhost adapter simple. As vhost protocol is de factor standard, it leverages 
> kernel device driver to provide the diversity. Changing once in QEMU, then it 
> supports multi-vendor devices whose drivers naturally providing kernel driver 
> there.
> 
> If QEMU is going to build a user space driver framework there, we're open 
> mind on that, even leveraging DPDK as the underlay library. Looking forward 
> to more others' comments from community.
> 
> Steve

Dependency on a kernel driver is fine IMHO. It's the dependency on a
DPDK backend that makes people unhappy, since the functionality
in question is setup-time only.

As others said, we do not need to go overeboard. A couple of small
vendor-specific quirks in qemu isn't a big deal.

> > 
> > Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] vhost: return bool from *_access_ok() functions

2018-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2018 at 01:26:30PM +0800, Stefan Hajnoczi wrote:
> Currently vhost *_access_ok() functions return int.  This is error-prone
> because there are two popular conventions:
> 
> 1. 0 means failure, 1 means success
> 2. -errno means failure, 0 means success
> 
> Although vhost mostly uses #1, it does not do so consistently.
> umem_access_ok() uses #2.
> 
> This patch changes the return type from int to bool so that false means
> failure and true means success.  This eliminates a potential source of
> errors.
> 
> Suggested-by: Linus Torvalds 
> Signed-off-by: Stefan Hajnoczi 

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/vhost/vhost.h |  4 ++--
>  drivers/vhost/vhost.c | 66 
> +--
>  2 files changed, 35 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index ac4b6056f19a..6e00fa57af09 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
>  void vhost_dev_stop(struct vhost_dev *);
>  long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user 
> *argp);
>  long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
> -int vhost_vq_access_ok(struct vhost_virtqueue *vq);
> -int vhost_log_access_ok(struct vhost_dev *);
> +bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
> +bool vhost_log_access_ok(struct vhost_dev *);
>  
>  int vhost_get_vq_desc(struct vhost_virtqueue *,
> struct iovec iov[], unsigned int iov_count,
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 93fd0c75b0d8..b6a082ef33dd 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
>  
> -static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
> +static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
>  {
>   u64 a = addr / VHOST_PAGE_SIZE / 8;
>  
>   /* Make sure 64 bit math will not overflow. */
>   if (a > ULONG_MAX - (unsigned long)log_base ||
>   a + (unsigned long)log_base > ULONG_MAX)
> - return 0;
> + return false;
>  
>   return access_ok(VERIFY_WRITE, log_base + a,
>(sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
> @@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
>  }
>  
>  /* Caller should have vq mutex and device mutex. */
> -static int vq_memory_access_ok(void __user *log_base, struct vhost_umem 
> *umem,
> -int log_all)
> +static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem 
> *umem,
> + int log_all)
>  {
>   struct vhost_umem_node *node;
>  
>   if (!umem)
> - return 0;
> + return false;
>  
>   list_for_each_entry(node, &umem->umem_list, link) {
>   unsigned long a = node->userspace_addr;
>  
>   if (vhost_overflow(node->userspace_addr, node->size))
> - return 0;
> + return false;
>  
>  
>   if (!access_ok(VERIFY_WRITE, (void __user *)a,
>   node->size))
> - return 0;
> + return false;
>   else if (log_all && !log_access_ok(log_base,
>  node->start,
>  node->size))
> - return 0;
> + return false;
>   }
> - return 1;
> + return true;
>  }
>  
>  static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
> @@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct 
> vhost_virtqueue *vq,
>  
>  /* Can we switch to this memory table? */
>  /* Caller should have device mutex but not vq mutex */
> -static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
> - int log_all)
> +static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
> +  int log_all)
>  {
>   int i;
>  
>   for (i = 0; i < d->nvqs; ++i) {
> - int ok;
> + bool ok;
>   bool log;
>  
>   mutex_lock(&d->vqs[i]->mutex);
> @@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct 
> vhost_umem *umem,
>   ok = vq_memory_access_ok(d->vqs[i]->log_base,
>umem, log);
>   else
> - ok = 1;
> + ok = true;
>   mutex_unlock(&d->vqs[i]->mutex);
>   if (!ok)
> - return 0;
> + return false;
>   }
> - return 1;
> + return true;
>  }
>  
>  static int translate_desc(struct vh

Re: [PATCH v2 1/2] vhost: fix vhost_vq_access_ok() log check

2018-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2018 at 04:22:35PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 10, 2018 at 01:26:29PM +0800, Stefan Hajnoczi wrote:
> > Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
> > when IOTLB is enabled") introduced a regression.  The logic was
> > originally:
> > 
> >   if (vq->iotlb)
> >   return 1;
> >   return A && B;
> > 
> > After the patch the short-circuit logic for A was inverted:
> > 
> >   if (A || vq->iotlb)
> >   return A;
> >   return B;
> > 
> > This patch fixes the regression by rewriting the checks in the obvious
> > way, no longer returning A when vq->iotlb is non-NULL (which is hard to
> > understand).
> > 
> > Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
> > Cc: Jason Wang 
> > Signed-off-by: Stefan Hajnoczi 
> 
> This patch only makes sense after patch 2/2 is applied.
> Otherwise the logic seems reversed below.
> 
> Can you pls squash these two?

Oh, in fact the patch is ok. This just goes to show
that patch 2/2 is useful. But squashing is not required.
Sorry about the noise.

Acked-by: Michael S. Tsirkin 
Fixes: d65026c6c ("vhost: validate log when IOTLB is enabled")

probably stable material since patch it fixes was queued to stable?


> > ---
> >  drivers/vhost/vhost.c | 8 +---
> >  1 file changed, 5 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 5320039671b7..93fd0c75b0d8 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue 
> > *vq,
> >  /* Caller should have vq mutex and device mutex */
> >  int vhost_vq_access_ok(struct vhost_virtqueue *vq)
> >  {
> > -   int ret = vq_log_access_ok(vq, vq->log_base);
> > +   if (!vq_log_access_ok(vq, vq->log_base))
> > +   return 0;
> >  
> > -   if (ret || vq->iotlb)
> > -   return ret;
> > +   /* Access validation occurs at prefetch time with IOTLB */
> > +   if (vq->iotlb)
> > +   return 1;
> >  
> > return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
> >  }
> > -- 
> > 2.14.3
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/2] vhost: fix vhost_vq_access_ok() log check

2018-04-10 Thread Michael S. Tsirkin
On Tue, Apr 10, 2018 at 01:26:29PM +0800, Stefan Hajnoczi wrote:
> Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
> when IOTLB is enabled") introduced a regression.  The logic was
> originally:
> 
>   if (vq->iotlb)
>   return 1;
>   return A && B;
> 
> After the patch the short-circuit logic for A was inverted:
> 
>   if (A || vq->iotlb)
>   return A;
>   return B;
> 
> This patch fixes the regression by rewriting the checks in the obvious
> way, no longer returning A when vq->iotlb is non-NULL (which is hard to
> understand).
> 
> Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
> Cc: Jason Wang 
> Signed-off-by: Stefan Hajnoczi 

This patch only makes sense after patch 2/2 is applied.
Otherwise the logic seems reversed below.

Can you pls squash these two?

> ---
>  drivers/vhost/vhost.c | 8 +---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 5320039671b7..93fd0c75b0d8 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue 
> *vq,
>  /* Caller should have vq mutex and device mutex */
>  int vhost_vq_access_ok(struct vhost_virtqueue *vq)
>  {
> - int ret = vq_log_access_ok(vq, vq->log_base);
> + if (!vq_log_access_ok(vq, vq->log_base))
> + return 0;
>  
> - if (ret || vq->iotlb)
> - return ret;
> + /* Access validation occurs at prefetch time with IOTLB */
> + if (vq->iotlb)
> + return 1;
>  
>   return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
>  }
> -- 
> 2.14.3
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-10 Thread Jiri Pirko
Mon, Apr 09, 2018 at 08:47:06PM CEST, sridhar.samudr...@intel.com wrote:
>On 4/9/2018 1:07 AM, Jiri Pirko wrote:
>> Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:
>> > On 4/6/2018 5:48 AM, Jiri Pirko wrote:
>> > > Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:
>> [...]
>> 
>> > > > +static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
>> > > > +   struct net_device *child_netdev)
>> > > > +{
>> > > > +  struct virtnet_bypass_info *vbi;
>> > > > +  bool backup;
>> > > > +
>> > > > +  vbi = netdev_priv(bypass_netdev);
>> > > > +  backup = (child_netdev->dev.parent == 
>> > > > bypass_netdev->dev.parent);
>> > > > +  if (backup ? rtnl_dereference(vbi->backup_netdev) :
>> > > > +  rtnl_dereference(vbi->active_netdev)) {
>> > > > +  netdev_info(bypass_netdev,
>> > > > +  "%s attempting to join bypass dev when %s 
>> > > > already present\n",
>> > > > +  child_netdev->name, backup ? "backup" : 
>> > > > "active");
>> > > Bypass module should check if there is already some other netdev
>> > > enslaved and refuse right there.
>> > This will work for virtio-net with 3 netdev model, but this check has to 
>> > be done by netvsc
>> > as its bypass_netdev is same as the backup_netdev.
>> > Will add a flag while registering with the bypass module to indicate if 
>> > the driver is doing
>> > a 2 netdev or 3 netdev model and based on that flag this check can be done 
>> > in bypass module
>> > for 3 netdev scenario.
>> Just let me undestand it clearly. What I expect the difference would be
>> between 2netdev and3 netdev model is this:
>> 2netdev:
>>bypass_master
>>   /
>>  /
>> VF_slave
>> 
>> 3netdev:
>>bypass_master
>>   / \
>>  /   \
>> VF_slave   backup_slave
>> 
>> Is that correct? If not, how does it look like?
>> 
>> 
>Looks correct.
>VF_slave and backup_slave are the original netdevs and are present in both the 
>models.
>In the 3 netdev model, bypass_master netdev is created and VF_slave and 
>backup_slave are
>marked as the 2 slaves of this new netdev.

You say it looks correct and in another sentence you provide completely
different description. Could you please look again?

[...]
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


RE: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Liang, Cunming


> -Original Message-
> From: Paolo Bonzini [mailto:pbonz...@redhat.com]
> Sent: Tuesday, April 10, 2018 3:52 PM
> To: Bie, Tiwei ; Jason Wang 
> Cc: m...@redhat.com; alex.william...@redhat.com; ddut...@redhat.com;
> Duyck, Alexander H ; virtio-dev@lists.oasis-
> open.org; linux-ker...@vger.kernel.org; k...@vger.kernel.org;
> virtualization@lists.linux-foundation.org; net...@vger.kernel.org; Daly, Dan
> ; Liang, Cunming ; Wang,
> Zhihong ; Tan, Jianfeng ;
> Wang, Xiao W 
> Subject: Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware
> vhost backend
> 
> On 10/04/2018 06:57, Tiwei Bie wrote:
> >> So you just move the abstraction layer from qemu to kernel, and you
> >> still need different drivers in kernel for different device
> >> interfaces of accelerators. This looks even more complex than leaving
> >> it in qemu. As you said, another idea is to implement userspace vhost
> >> backend for accelerators which seems easier and could co-work with
> >> other parts of qemu without inventing new type of messages.
> >
> > I'm not quite sure. Do you think it's acceptable to add various vendor
> > specific hardware drivers in QEMU?
> 
> I think so.  We have vendor-specific quirks, and at some point there was an
> idea of using quirks to implement (vendor-specific) live migration support for
> assigned devices.

Vendor-specific quirks of accessing VGA is a small portion. Other major 
portions are still handled by guest driver.

While in this case, when saying various vendor specific drivers in QEMU, it 
says QEMU takes over and provides the entire user space device drivers. Some 
parts are even not relevant to vhost, they're basic device function enabling. 
Moreover, it could be different kinds of devices(network/block/...) under 
vhost. No matter # of vendors or # of types, total LOC is not small.

The idea is to avoid introducing these extra complexity out of QEMU, keeping 
vhost adapter simple. As vhost protocol is de factor standard, it leverages 
kernel device driver to provide the diversity. Changing once in QEMU, then it 
supports multi-vendor devices whose drivers naturally providing kernel driver 
there.

If QEMU is going to build a user space driver framework there, we're open mind 
on that, even leveraging DPDK as the underlay library. Looking forward to more 
others' comments from community.

Steve

> 
> Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Paolo Bonzini
On 10/04/2018 06:57, Tiwei Bie wrote:
>> So you just move the abstraction layer from qemu to kernel, and you still
>> need different drivers in kernel for different device interfaces of
>> accelerators. This looks even more complex than leaving it in qemu. As you
>> said, another idea is to implement userspace vhost backend for accelerators
>> which seems easier and could co-work with other parts of qemu without
>> inventing new type of messages.
> 
> I'm not quite sure. Do you think it's acceptable to
> add various vendor specific hardware drivers in QEMU?

I think so.  We have vendor-specific quirks, and at some point there was
an idea of using quirks to implement (vendor-specific) live migration
support for assigned devices.

Paolo
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-10 Thread Jason Wang



On 2018年04月10日 12:57, Tiwei Bie wrote:

On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote:

On 2018年04月02日 23:23, Tiwei Bie wrote:

This patch introduces a mdev (mediated device) based hardware
vhost backend. This backend is an abstraction of the various
hardware vhost accelerators (potentially any device that uses
virtio ring can be used as a vhost accelerator). Some generic
mdev parent ops are provided for accelerator drivers to support
generating mdev instances.

What's this
===

The idea is that we can setup a virtio ring compatible device
with the messages available at the vhost-backend. Originally,
these messages are used to implement a software vhost backend,
but now we will use these messages to setup a virtio ring
compatible hardware device. Then the hardware device will be
able to work with the guest virtio driver in the VM just like
what the software backend does. That is to say, we can implement
a hardware based vhost backend in QEMU, and any virtio ring
compatible devices potentially can be used with this backend.
(We also call it vDPA -- vhost Data Path Acceleration).

One problem is that, different virtio ring compatible devices
may have different device interfaces. That is to say, we will
need different drivers in QEMU. It could be troublesome. And
that's what this patch trying to fix. The idea behind this
patch is very simple: mdev is a standard way to emulate device
in kernel.

So you just move the abstraction layer from qemu to kernel, and you still
need different drivers in kernel for different device interfaces of
accelerators. This looks even more complex than leaving it in qemu. As you
said, another idea is to implement userspace vhost backend for accelerators
which seems easier and could co-work with other parts of qemu without
inventing new type of messages.

I'm not quite sure. Do you think it's acceptable to
add various vendor specific hardware drivers in QEMU?



I don't object but we need to figure out the advantages of doing it in 
qemu too.


Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization