Re: [PATCH net-next 1/2] virtio_net: Fix short frame length check

2023-01-13 Thread Alexander Duyck
On Fri, Jan 13, 2023 at 4:23 PM Alexander Duyck
 wrote:
>
> On Fri, Jan 13, 2023 at 3:37 PM Parav Pandit  wrote:
> >
> >
> > > From: Alexander H Duyck 
> > > Sent: Friday, January 13, 2023 6:24 PM
> > >
> > > On Sat, 2023-01-14 at 00:36 +0200, Parav Pandit wrote:
> > > > A smallest Ethernet frame defined by IEEE 802.3 is 60 bytes without
> > > > any preemble and CRC.
> > > >
> > > > Current code only checks for minimal 14 bytes of Ethernet header length.
> > > > Correct it to consider the minimum Ethernet frame length.
> > > >
> > > > Fixes: 296f96fcfc16 ("Net driver using virtio")
> > > > Signed-off-by: Parav Pandit 
> > > > ---
> > > >  drivers/net/virtio_net.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index
> > > > 7723b2a49d8e..d45e140b6852 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -1248,7 +1248,7 @@ static void receive_buf(struct virtnet_info *vi,
> > > struct receive_queue *rq,
> > > > struct sk_buff *skb;
> > > > struct virtio_net_hdr_mrg_rxbuf *hdr;
> > > >
> > > > -   if (unlikely(len < vi->hdr_len + ETH_HLEN)) {
> > > > +   if (unlikely(len < vi->hdr_len + ETH_ZLEN)) {
> > > > pr_debug("%s: short packet %i\n", dev->name, len);
> > > > dev->stats.rx_length_errors++;
> > > > if (vi->mergeable_rx_bufs) {
> > >
> > > I'm not sure I agree with this change as packets are only 60B if they 
> > > have gone
> > > across the wire as they are usually padded out on the transmit side. 
> > > There may
> > > be cases where software routed packets may not be 60B.
> > >
> > Do you mean Linux kernel software? Any link to it would be helpful.
>
> The problem is there are several software paths involved and that is
> why I am wanting to be cautious. As I recall this would impact Qemu
> itself, DPDK, the Linux Kernel and several others if I am not
> mistaken. That is why I am tending to err on the side of caution as
> this is a pretty significant change.
>
> > > As such rather than changing out ETH_HLEN for ETH_ZLEN I wonder if we
> > > should look at maybe making this a "<=" comparison instead since that is 
> > > the
> > > only case I can think of where the packet would end up being entirely 
> > > empty
> > > after eth_type_trans is called and we would be passing an skb with length 
> > > 0.
> >
> > I likely didn’t understand your comment.
> > This driver check is before creating the skb for the received packet.
> > So, purpose is to not even process the packet header or prepare the skb if 
> > it not an Ethernet frame.
> >
> > It is interesting to know when we get < 60B frame.
>
> If I recall, a UDPv4 frame can easily do it since Ethernet is 14B, IP
> header is 20, and UDP is only 8 so that only comes to 42B if I recall
> correctly. Similarly I think a TCPv4 Frame can be as small as 54B if
> you disable all the option headers.
>
> A quick and dirty test would be to run something like a netperf UDP_RR
> test. I know in the case of the network stack we see the transmits
> that go out are less than 60B until they are padded on xmit, usually
> by the device. My concern is wanting to make sure all those paths are
> covered before we assume that all the packets will be padded.

I was curious so I decided to try verifying things with a qemu w/ user
networking and virtio-net. From what I can tell it looks like it is
definitely not padding them out.

19:34:38.331376 IP (tos 0x0, ttl 64, id 31799, offset 0, flags [DF],
proto UDP (17), length 29)
localhost.localdomain.59579 > _gateway.52701: [udp sum ok] UDP, length 1
0x:  5255 0a00 0202 5254 0012 3456 0800 4500
0x0010:  001d 7c37 4000 4011 a688 0a00 020f 0a00
0x0020:  0202 e8bb cddd 0009 c331 6e
19:34:38.331431 IP (tos 0x0, ttl 64, id 45459, offset 0, flags [none],
proto UDP (17), length 29)
_gateway.52701 > localhost.localdomain.59579: [udp sum ok] UDP, length 1
0x:  5254 0012 3456 5255 0a00 0202 0800 4500
0x0010:  001d b193  4011 b12c 0a00 0202 0a00
0x0020:  020f cddd e8bb 0009 c331 6e
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 1/2] virtio_net: Fix short frame length check

2023-01-13 Thread Alexander Duyck
On Fri, Jan 13, 2023 at 3:37 PM Parav Pandit  wrote:
>
>
> > From: Alexander H Duyck 
> > Sent: Friday, January 13, 2023 6:24 PM
> >
> > On Sat, 2023-01-14 at 00:36 +0200, Parav Pandit wrote:
> > > A smallest Ethernet frame defined by IEEE 802.3 is 60 bytes without
> > > any preemble and CRC.
> > >
> > > Current code only checks for minimal 14 bytes of Ethernet header length.
> > > Correct it to consider the minimum Ethernet frame length.
> > >
> > > Fixes: 296f96fcfc16 ("Net driver using virtio")
> > > Signed-off-by: Parav Pandit 
> > > ---
> > >  drivers/net/virtio_net.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index
> > > 7723b2a49d8e..d45e140b6852 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -1248,7 +1248,7 @@ static void receive_buf(struct virtnet_info *vi,
> > struct receive_queue *rq,
> > > struct sk_buff *skb;
> > > struct virtio_net_hdr_mrg_rxbuf *hdr;
> > >
> > > -   if (unlikely(len < vi->hdr_len + ETH_HLEN)) {
> > > +   if (unlikely(len < vi->hdr_len + ETH_ZLEN)) {
> > > pr_debug("%s: short packet %i\n", dev->name, len);
> > > dev->stats.rx_length_errors++;
> > > if (vi->mergeable_rx_bufs) {
> >
> > I'm not sure I agree with this change as packets are only 60B if they have 
> > gone
> > across the wire as they are usually padded out on the transmit side. There 
> > may
> > be cases where software routed packets may not be 60B.
> >
> Do you mean Linux kernel software? Any link to it would be helpful.

The problem is there are several software paths involved and that is
why I am wanting to be cautious. As I recall this would impact Qemu
itself, DPDK, the Linux Kernel and several others if I am not
mistaken. That is why I am tending to err on the side of caution as
this is a pretty significant change.

> > As such rather than changing out ETH_HLEN for ETH_ZLEN I wonder if we
> > should look at maybe making this a "<=" comparison instead since that is the
> > only case I can think of where the packet would end up being entirely empty
> > after eth_type_trans is called and we would be passing an skb with length 0.
>
> I likely didn’t understand your comment.
> This driver check is before creating the skb for the received packet.
> So, purpose is to not even process the packet header or prepare the skb if it 
> not an Ethernet frame.
>
> It is interesting to know when we get < 60B frame.

If I recall, a UDPv4 frame can easily do it since Ethernet is 14B, IP
header is 20, and UDP is only 8 so that only comes to 42B if I recall
correctly. Similarly I think a TCPv4 Frame can be as small as 54B if
you disable all the option headers.

A quick and dirty test would be to run something like a netperf UDP_RR
test. I know in the case of the network stack we see the transmits
that go out are less than 60B until they are padded on xmit, usually
by the device. My concern is wanting to make sure all those paths are
covered before we assume that all the packets will be padded.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 4/6] hugetlb: avoid allocation failed when page reporting is on going

2021-01-07 Thread Alexander Duyck
On Wed, Jan 6, 2021 at 7:57 PM Liang Li  wrote:
>
> > > Page reporting isolates free pages temporarily when reporting
> > > free pages information. It will reduce the actual free pages
> > > and may cause application failed for no enough available memory.
> > > This patch try to solve this issue, when there is no free page
> > > and page repoting is on going, wait until it is done.
> > >
> > > Cc: Alexander Duyck 
> >
> > Please don't use this email address for me anymore. Either use
> > alexander.du...@gmail.com or alexanderdu...@fb.com. I am getting
> > bounces when I reply to this thread because of the old address.
>
> No problem.
>
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index eb533995cb49..0fccd5f96954 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -2320,6 +2320,12 @@ struct page *alloc_huge_page(struct vm_area_struct 
> > > *vma,
> > > goto out_uncharge_cgroup_reservation;
> > >
> > > spin_lock(_lock);
> > > +   while (h->free_huge_pages <= 1 && h->isolated_huge_pages) {
> > > +   spin_unlock(_lock);
> > > +   mutex_lock(>mtx_prezero);
> > > +   mutex_unlock(>mtx_prezero);
> > > +   spin_lock(_lock);
> > > +   }
> >
> > This seems like a bad idea. It kind of defeats the whole point of
> > doing the page zeroing outside of the hugetlb_lock. Also it is
> > operating on the assumption that the only way you might get a page is
> > from the page zeroing logic.
> >
> > With the page reporting code we wouldn't drop the count to zero. We
> > had checks that were going through and monitoring the watermarks and
> > if we started to hit the low watermark we would stop page reporting
> > and just assume there aren't enough pages to report. You might need to
> > look at doing something similar here so that you can avoid colliding
> > with the allocator.
>
> For hugetlb, things are a little different, Just like Mike points out:
>  "On some systems, hugetlb pages are a precious resource and
>   the sysadmin carefully configures the number needed by
>   applications.  Removing a hugetlb page (even for a very short
>   period of time) could cause serious application failure."
>
> Just keeping some pages in the freelist is not enough to prevent that from
> happening, because these pages may be allocated while zero out is on
> going, and application may still run into a situation for not available free
> pages.

I get what you are saying. However I don't know if it is acceptable
for the allocating thread to be put to sleep in this situation. There
are two scenarios where I can see this being problematic.

One is a setup where you put the page allocator to sleep and while it
is sleeping another thread is then freeing a page and your thread
cannot respond to that newly freed page and is stuck waiting on the
zeroed page.

The second issue is that users may want a different option of just
breaking up the request into smaller pages rather than waiting on the
page zeroing, or to do something else while waiting on the page. So
instead of sitting on the request and waiting it might make more sense
to return an error pointer like EAGAIN or EBUSY to indicate that there
is a page there, but it is momentarily tied up.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 4/6] hugetlb: avoid allocation failed when page reporting is on going

2021-01-06 Thread Alexander Duyck
On Tue, Jan 5, 2021 at 7:50 PM Liang Li  wrote:
>
> Page reporting isolates free pages temporarily when reporting
> free pages information. It will reduce the actual free pages
> and may cause application failed for no enough available memory.
> This patch try to solve this issue, when there is no free page
> and page repoting is on going, wait until it is done.
>
> Cc: Alexander Duyck 

Please don't use this email address for me anymore. Either use
alexander.du...@gmail.com or alexanderdu...@fb.com. I am getting
bounces when I reply to this thread because of the old address.

> Cc: Mel Gorman 
> Cc: Andrea Arcangeli 
> Cc: Dan Williams 
> Cc: Dave Hansen 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Alex Williamson 
> Cc: Michael S. Tsirkin 
> Cc: Liang Li 
> Signed-off-by: Liang Li 
> ---
>  include/linux/hugetlb.h | 2 ++
>  mm/hugetlb.c| 9 +
>  mm/page_reporting.c | 6 +-
>  3 files changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index d55e6a00b3dc..73b2934ba91c 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -490,6 +490,7 @@ struct hstate {
> unsigned long resv_huge_pages;
> unsigned long surplus_huge_pages;
> unsigned long nr_overcommit_huge_pages;
> +   unsigned long isolated_huge_pages;
> struct list_head hugepage_activelist;
> struct list_head hugepage_freelists[MAX_NUMNODES];
> unsigned int nr_huge_pages_node[MAX_NUMNODES];
> @@ -500,6 +501,7 @@ struct hstate {
> struct cftype cgroup_files_dfl[7];
> struct cftype cgroup_files_legacy[9];
>  #endif
> +   struct mutex mtx_prezero;
> char name[HSTATE_NAME_LEN];
>  };
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index eb533995cb49..0fccd5f96954 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2320,6 +2320,12 @@ struct page *alloc_huge_page(struct vm_area_struct 
> *vma,
> goto out_uncharge_cgroup_reservation;
>
> spin_lock(_lock);
> +   while (h->free_huge_pages <= 1 && h->isolated_huge_pages) {
> +   spin_unlock(_lock);
> +   mutex_lock(>mtx_prezero);
> +   mutex_unlock(>mtx_prezero);
> +   spin_lock(_lock);
> +   }

This seems like a bad idea. It kind of defeats the whole point of
doing the page zeroing outside of the hugetlb_lock. Also it is
operating on the assumption that the only way you might get a page is
from the page zeroing logic.

With the page reporting code we wouldn't drop the count to zero. We
had checks that were going through and monitoring the watermarks and
if we started to hit the low watermark we would stop page reporting
and just assume there aren't enough pages to report. You might need to
look at doing something similar here so that you can avoid colliding
with the allocator.


> /*
>  * glb_chg is passed to indicate whether or not a page must be taken
>  * from the global free pool (global change).  gbl_chg == 0 indicates
> @@ -3208,6 +3214,7 @@ void __init hugetlb_add_hstate(unsigned int order)
> INIT_LIST_HEAD(>hugepage_activelist);
> h->next_nid_to_alloc = first_memory_node;
> h->next_nid_to_free = first_memory_node;
> +   mutex_init(>mtx_prezero);
> snprintf(h->name, HSTATE_NAME_LEN, "hugepages-%lukB",
> huge_page_size(h)/1024);
>
> @@ -5541,6 +5548,7 @@ void isolate_free_huge_page(struct page *page, struct 
> hstate *h, int nid)
>
> list_move(>lru, >hugepage_activelist);
> set_page_refcounted(page);
> +   h->isolated_huge_pages++;
>  }
>
>  void putback_isolate_huge_page(struct hstate *h, struct page *page)
> @@ -5548,6 +5556,7 @@ void putback_isolate_huge_page(struct hstate *h, struct 
> page *page)
> int nid = page_to_nid(page);
>
> list_move(>lru, >hugepage_freelists[nid]);
> +   h->isolated_huge_pages--;
>  }
>
>  bool isolate_huge_page(struct page *page, struct list_head *list)
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index cc31696225bb..99e1e688d7c1 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -272,12 +272,15 @@ hugepage_reporting_process_hstate(struct 
> page_reporting_dev_info *prdev,
> int ret = 0, nid;
>
> offset = max_items;
> +   mutex_lock(>mtx_prezero);
> for (nid = 0; nid < MAX_NUMNODES; nid++) {
> ret = hugepage_reporting_cycle(prdev, h, nid, sgl, ,
>  

Re: [PATCH 2/6] mm: let user decide page reporting option

2021-01-06 Thread Alexander Duyck
On Tue, Jan 5, 2021 at 7:48 PM Liang Li  wrote:
>
> Some key parameters for page reporting are now hard coded, different
> users of the framework may have their special requirements, make
> these parameter configrable and let the user decide them.
>
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Andrea Arcangeli 
> Cc: Dan Williams 
> Cc: Dave Hansen 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Alex Williamson 
> Cc: Michael S. Tsirkin 
> Cc: Liang Li 
> Signed-off-by: Liang Li 
> ---
>  drivers/virtio/virtio_balloon.c |  3 +++
>  include/linux/page_reporting.h  |  3 +++
>  mm/page_reporting.c | 13 +
>  mm/page_reporting.h |  6 +++---
>  4 files changed, 18 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 8985fc2cea86..684bcc39ef5a 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -993,6 +993,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
> goto out_unregister_oom;
> }
>
> +   vb->pr_dev_info.mini_order = pageblock_order;
> +   vb->pr_dev_info.batch_size = 16 * 1024 * 1024; /* 16M */
> +   vb->pr_dev_info.delay_jiffies = 2 * HZ; /* 2 seconds */
> err = page_reporting_register(>pr_dev_info);
> if (err)
> goto out_unregister_oom;
> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> index 3b99e0ec24f2..63e1e9fbcaa2 100644
> --- a/include/linux/page_reporting.h
> +++ b/include/linux/page_reporting.h
> @@ -13,6 +13,9 @@ struct page_reporting_dev_info {
> int (*report)(struct page_reporting_dev_info *prdev,
>   struct scatterlist *sg, unsigned int nents);
>
> +   unsigned long batch_size;
> +   unsigned long delay_jiffies;
> +   int mini_order;
> /* work struct for processing reports */
> struct delayed_work work;
>
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index 694df981ddd2..39bc6a9d7b73 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -13,6 +13,7 @@
>  #define PAGE_REPORTING_DELAY   (2 * HZ)
>  static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
>  unsigned long page_report_batch_size  __read_mostly = 16 * 1024 * 1024UL;
> +int page_report_mini_order = pageblock_order;
>
>  enum {
> PAGE_REPORTING_IDLE = 0,
> @@ -44,7 +45,7 @@ __page_reporting_request(struct page_reporting_dev_info 
> *prdev)
>  * now we are limiting this to running no more than once every
>  * couple of seconds.
>  */
> -   schedule_delayed_work(>work, PAGE_REPORTING_DELAY);
> +   schedule_delayed_work(>work, prdev->delay_jiffies);
>  }
>

So this ends up being the reason why you needed to add the batch size
value. However I don't really see it working as expected since you
could essentially have 1 page freed 4M times that could trigger your
page zeroing logic. So for example if a NIC is processing frames and
ends up freeing and then reallocating some small batch of pages this
could would be running often even though there isn't really all that
many pages that needed zeroing.

>  /* notify prdev of free page reporting request */
> @@ -230,7 +231,7 @@ page_reporting_process_zone(struct 
> page_reporting_dev_info *prdev,
>
> /* Generate minimum watermark to be able to guarantee progress */
> watermark = low_wmark_pages(zone) +
> -   (PAGE_REPORTING_CAPACITY << PAGE_REPORTING_MIN_ORDER);
> +   (PAGE_REPORTING_CAPACITY << prdev->mini_order);
>
> /*
>  * Cancel request if insufficient free memory or if we failed

With the page order being able to be greatly reduced this could have a
significant impact on if this code really has any value. Previously we
were able to guarantee a pretty significant number of higher order
pages free. With this we might only be guaranteeing something like 32
4K pages which is pretty small compared to what can end up being
pulled out at the higher end.

> @@ -240,7 +241,7 @@ page_reporting_process_zone(struct 
> page_reporting_dev_info *prdev,
> return err;
>
> /* Process each free list starting from lowest order/mt */
> -   for (order = PAGE_REPORTING_MIN_ORDER; order < MAX_ORDER; order++) {
> +   for (order = prdev->mini_order; order < MAX_ORDER; order++) {
> for (mt = 0; mt < MIGRATE_TYPES; mt++) {
> /* We do not pull pages from the isolate free

Re: [PATCH 1/6] mm: Add batch size for free page reporting

2021-01-06 Thread Alexander Duyck
On Tue, Jan 5, 2021 at 7:47 PM Liang Li  wrote:
>
> Use the page order as the only threshold for page reporting
> is not flexible and has some flaws. Because scan a long free
> list is not cheap, it's better to wake up the page reporting
> worker when there are more pages, wake it up for a sigle page
> may not worth.
> This patch add a batch size as another threshold to control the
> waking up of reporting worker.
>
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Andrea Arcangeli 
> Cc: Dan Williams 
> Cc: Dave Hansen 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Alex Williamson 
> Cc: Michael S. Tsirkin 
> Cc: Liang Li 
> Signed-off-by: Liang Li 

So you are going to need a lot more explanation for this. Page
reporting already had the concept of batching as you could only scan
once every 2 seconds as I recall. Thus the "PAGE_REPORTING_DELAY". The
change you are making doesn't make any sense without additional
context.

> ---
>  mm/page_reporting.c |  1 +
>  mm/page_reporting.h | 12 ++--
>  2 files changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page_reporting.c b/mm/page_reporting.c
> index cd8e13d41df4..694df981ddd2 100644
> --- a/mm/page_reporting.c
> +++ b/mm/page_reporting.c
> @@ -12,6 +12,7 @@
>
>  #define PAGE_REPORTING_DELAY   (2 * HZ)
>  static struct page_reporting_dev_info __rcu *pr_dev_info __read_mostly;
> +unsigned long page_report_batch_size  __read_mostly = 16 * 1024 * 1024UL;
>
>  enum {
> PAGE_REPORTING_IDLE = 0,
> diff --git a/mm/page_reporting.h b/mm/page_reporting.h
> index 2c385dd4ddbd..b8fb3bbb345f 100644
> --- a/mm/page_reporting.h
> +++ b/mm/page_reporting.h
> @@ -12,6 +12,8 @@
>
>  #define PAGE_REPORTING_MIN_ORDER   pageblock_order
>
> +extern unsigned long page_report_batch_size;
> +
>  #ifdef CONFIG_PAGE_REPORTING
>  DECLARE_STATIC_KEY_FALSE(page_reporting_enabled);
>  void __page_reporting_notify(void);
> @@ -33,6 +35,8 @@ static inline bool page_reported(struct page *page)
>   */
>  static inline void page_reporting_notify_free(unsigned int order)
>  {
> +   static long batch_size;
> +

I'm not sure this makes a tone of sense to place the value in an
inline function. It might make more sense to put this new code in
__page_reporting_notify so that all callers would be referring to the
same batch_size value and you don't have to bother with the export of
the page_report_batch_size value.

> /* Called from hot path in __free_one_page() */
> if (!static_branch_unlikely(_reporting_enabled))
> return;
> @@ -41,8 +45,12 @@ static inline void page_reporting_notify_free(unsigned int 
> order)
> if (order < PAGE_REPORTING_MIN_ORDER)
> return;
>
> -   /* This will add a few cycles, but should be called infrequently */
> -   __page_reporting_notify();
> +   batch_size += (1 << order) << PAGE_SHIFT;
> +   if (batch_size >= page_report_batch_size) {
> +   batch_size = 0;

I would probably run this in the opposite direction. Rather than
running batch_size to zero I would look at adding a "batch_remaining"
and then when it is < 0 you could then reset it back to
page_report_batch_size. Doing that you only have to read one variable
most of the time instead of doing a comparison against two.

> +   /* This add a few cycles, but should be called infrequently */
> +   __page_reporting_notify();
> +   }
>  }
>  #else /* CONFIG_PAGE_REPORTING */
>  #define page_reported(_page)   false
> --
> 2.18.2
>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-23 Thread Alexander Duyck
On Tue, Dec 22, 2020 at 7:39 PM Liang Li  wrote:
>
> > > +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev,
> > > +struct hstate *h, unsigned int nid,
> > > +struct scatterlist *sgl, unsigned int *offset)
> > > +{
> > > +   struct list_head *list = >hugepage_freelists[nid];
> > > +   unsigned int page_len = PAGE_SIZE << h->order;
> > > +   struct page *page, *next;
> > > +   long budget;
> > > +   int ret = 0, scan_cnt = 0;
> > > +
> > > +   /*
> > > +* Perform early check, if free area is empty there is
> > > +* nothing to process so we can skip this free_list.
> > > +*/
> > > +   if (list_empty(list))
> > > +   return ret;
> > > +
> > > +   spin_lock_irq(_lock);
> > > +
> > > +   if (huge_page_order(h) > MAX_ORDER)
> > > +   budget = HUGEPAGE_REPORTING_CAPACITY;
> > > +   else
> > > +   budget = HUGEPAGE_REPORTING_CAPACITY * 32;
> >
> > Wouldn't huge_page_order always be more than MAX_ORDER? Seems like we
> > don't even really need budget since this should probably be pulling
> > out no more than one hugepage at a time.
>
> I want to disting a 2M page and 1GB page here. The order of 1GB page is 
> greater
> than MAX_ORDER while 2M page's order is less than MAX_ORDER.

The budget here is broken. When I put the budget in page reporting it
was so that we wouldn't try to report all of the memory in a given
region. It is meant to hold us to no more than one pass through 1/16
of the free memory. So essentially we will be slowly processing all of
memory and it will take 16 calls (32 seconds) for us to process a
system that is sitting completely idle. It is meant to pace us so we
don't spend a ton of time doing work that will be undone, not to
prevent us from burying a CPU which is what seems to be implied here.

Using HUGEPAGE_REPORTING_CAPACITY makes no sense here. I was using it
in the original definition because it was how many pages we could
scoop out at a time and then I was aiming for a 16th of that. Here you
are arbitrarily squaring HUGEPAGE_REPORTING_CAPACITY in terms of the
amount of work you will doo since you are using it as a multiple
instead of a divisor.

> >
> > > +   /* loop through free list adding unreported pages to sg list */
> > > +   list_for_each_entry_safe(page, next, list, lru) {
> > > +   /* We are going to skip over the reported pages. */
> > > +   if (PageReported(page)) {
> > > +   if (++scan_cnt >= MAX_SCAN_NUM) {
> > > +   ret = scan_cnt;
> > > +   break;
> > > +   }
> > > +   continue;
> > > +   }
> > > +
> >
> > It would probably have been better to place this set before your new
> > set. I don't see your new set necessarily being the best use for page
> > reporting.
>
> I haven't really latched on to what you mean, could you explain it again?

It would be better for you to spend time understanding how this patch
set works before you go about expanding it to do other things.
Mistakes like the budget one above kind of point out the fact that you
don't understand how this code was supposed to work and just kind of
shoehorned you page zeroing code onto it.

It would be better to look at trying to understand this code first
before you extend it to support your zeroing use case. So adding huge
pages first might make more sense than trying to zero and push the
order down. The fact is the page reporting extension should be minimal
for huge pages since they are just passed as a scatterlist so you
should only need to add a small bit to page_reporting.c to extend it
to support this use case.

> >
> > > +   /*
> > > +* If we fully consumed our budget then update our
> > > +* state to indicate that we are requesting additional
> > > +* processing and exit this list.
> > > +*/
> > > +   if (budget < 0) {
> > > +   atomic_set(>state, 
> > > PAGE_REPORTING_REQUESTED);
> > > +   next = page;
> > > +   break;
> > > +   }
> > > +
> >
> > If budget is only ever going to be 1 then we probably could just look
> > at making this the default case for any time we find a non-reported
> > page.
>
> and here again.

It comes down to the fact that the changes you made have a significant
impact on how this is supposed to function. Reducing the scatterlist
to a size of one makes the whole point of doing batching kind of
pointless. Basically the code should be rewritten with the assumption
that if you find a page you report it.

The old code would batch things up because there is significant
overhead to be addressed when going to the hypervisor to report said
memory. Your code doesn't seem to really take anything like that into
account 

Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting

2020-12-22 Thread Alexander Duyck
On Mon, Dec 21, 2020 at 11:47 PM Liang Li  wrote:
>
> Free page reporting only supports buddy pages, it can't report the
> free pages reserved for hugetlbfs case. On the other hand, hugetlbfs
> is a good choice for a system with a huge amount of RAM, because it
> can help to reduce the memory management overhead and improve system
> performance.
> This patch add the support for reporting hugepages in the free list
> of hugetlb, it canbe used by virtio_balloon driver for memory
> overcommit and pre zero out free pages for speeding up memory population.
>
> Cc: Alexander Duyck 
> Cc: Mel Gorman 
> Cc: Andrea Arcangeli 
> Cc: Dan Williams 
> Cc: Dave Hansen 
> Cc: David Hildenbrand 
> Cc: Michal Hocko 
> Cc: Andrew Morton 
> Cc: Alex Williamson 
> Cc: Michael S. Tsirkin 
> Cc: Jason Wang 
> Cc: Mike Kravetz 
> Cc: Liang Li 
> Signed-off-by: Liang Li 
> ---
>  include/linux/hugetlb.h|   3 +
>  include/linux/page_reporting.h |   5 +
>  mm/hugetlb.c   |  29 
>  mm/page_reporting.c| 287 +
>  mm/page_reporting.h|  34 
>  5 files changed, 358 insertions(+)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ebca2ef02212..a72ad25501d3 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -11,6 +11,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  struct ctl_table;
>  struct user_struct;
> @@ -114,6 +115,8 @@ int hugetlb_treat_movable_handler(struct ctl_table *, 
> int, void *, size_t *,
>  int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void *, size_t 
> *,
> loff_t *);
>
> +bool isolate_free_huge_page(struct page *page, struct hstate *h, int nid);
> +void putback_isolate_huge_page(struct hstate *h, struct page *page);
>  int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct 
> vm_area_struct *);
>  long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
>  struct page **, struct vm_area_struct **,
> diff --git a/include/linux/page_reporting.h b/include/linux/page_reporting.h
> index 63e1e9fbcaa2..0da3d1a6f0cc 100644
> --- a/include/linux/page_reporting.h
> +++ b/include/linux/page_reporting.h
> @@ -7,6 +7,7 @@
>
>  /* This value should always be a power of 2, see page_reporting_cycle() */
>  #define PAGE_REPORTING_CAPACITY32
> +#define HUGEPAGE_REPORTING_CAPACITY1
>
>  struct page_reporting_dev_info {
> /* function that alters pages to make them "reported" */
> @@ -26,4 +27,8 @@ struct page_reporting_dev_info {
>  /* Tear-down and bring-up for page reporting devices */
>  void page_reporting_unregister(struct page_reporting_dev_info *prdev);
>  int page_reporting_register(struct page_reporting_dev_info *prdev);
> +
> +/* Tear-down and bring-up for hugepage reporting devices */
> +void hugepage_reporting_unregister(struct page_reporting_dev_info *prdev);
> +int hugepage_reporting_register(struct page_reporting_dev_info *prdev);
>  #endif /*_LINUX_PAGE_REPORTING_H */
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index cbf32d2824fd..de6ce147dfe2 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -41,6 +41,7 @@
>  #include 
>  #include 
>  #include 
> +#include "page_reporting.h"
>  #include "internal.h"
>
>  int hugetlb_max_hstate __read_mostly;
> @@ -1028,6 +1029,11 @@ static void enqueue_huge_page(struct hstate *h, struct 
> page *page)
> list_move(>lru, >hugepage_freelists[nid]);
> h->free_huge_pages++;
> h->free_huge_pages_node[nid]++;
> +   if (hugepage_reported(page)) {
> +   __ClearPageReported(page);
> +   pr_info("%s, free_huge_pages=%ld\n", __func__, 
> h->free_huge_pages);
> +   }
> +   hugepage_reporting_notify_free(h->order);
>  }
>
>  static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid)
> @@ -5531,6 +5537,29 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long 
> address, pgd_t *pgd, int fla
> return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> 
> PAGE_SHIFT);
>  }
>
> +bool isolate_free_huge_page(struct page *page, struct hstate *h, int nid)
> +{
> +   bool ret = true;
> +
> +   VM_BUG_ON_PAGE(!PageHead(page), page);
> +
> +   list_move(>lru, >hugepage_activelist);
> +   set_page_refcounted(page);
> +   h->free_huge_pages--;
> +   h->free_huge_pages_node[nid]--;
> +
> +   return ret;
> +}
> +
> +void putback_isolate_huge_page(struct hstate *h, struct page *page)
> 

Re: [RFC v2 PATCH 0/4] speed up page allocation for __GFP_ZERO

2020-12-22 Thread Alexander Duyck
On Mon, Dec 21, 2020 at 8:25 AM Liang Li  wrote:
>
> The first version can be found at: https://lkml.org/lkml/2020/4/12/42
>
> Zero out the page content usually happens when allocating pages with
> the flag of __GFP_ZERO, this is a time consuming operation, it makes
> the population of a large vma area very slowly. This patch introduce
> a new feature for zero out pages before page allocation, it can help
> to speed up page allocation with __GFP_ZERO.
>
> My original intention for adding this feature is to shorten VM
> creation time when SR-IOV devicde is attached, it works good and the
> VM creation time is reduced by about 90%.
>
> Creating a VM [64G RAM, 32 CPUs] with GPU passthrough
> =
> QEMU use 4K pages, THP is off
>   round1  round2  round3
> w/o this patch:23.5s   24.7s   24.6s
> w/ this patch: 10.2s   10.3s   11.2s
>
> QEMU use 4K pages, THP is on
>   round1  round2  round3
> w/o this patch:17.9s   14.8s   14.9s
> w/ this patch: 1.9s1.8s1.9s
> =
>
> Obviously, it can do more than this. We can benefit from this feature
> in the flowing case:

So I am not sure page reporting is the best thing to base this page
zeroing setup on. The idea with page reporting is to essentially act
as a leaky bucket and allow the guest to drop memory it isn't using
slowly so if it needs to reinflate it won't clash with the
applications that need memory. What you are doing here seems far more
aggressive in that you are going down to low order pages and sleeping
instead of rescheduling for the next time interval.

Also I am not sure your SR-IOV creation time test is a good
justification for this extra overhead. With your patches applied all
you are doing is making use of the free time before the test to do the
page zeroing instead of doing it during your test. As such your CPU
overhead prior to running the test would be higher and you haven't
captured that information.

One thing I would be interested in seeing is what is the load this is
adding when you are running simple memory allocation/free type tests
on the system. For example it might be useful to see what the
will-it-scale page_fault1 tests look like with this patch applied
versus not applied. I suspect it would be adding some amount of
overhead as you have to spend a ton of time scanning all the pages and
that will be considerable overhead.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_balloon: clear modern features under legacy

2020-07-14 Thread Alexander Duyck
On Tue, Jul 14, 2020 at 1:45 AM Michael S. Tsirkin  wrote:
>
> On Mon, Jul 13, 2020 at 08:10:14AM -0700, Alexander Duyck wrote:
> > On Sun, Jul 12, 2020 at 8:10 AM Michael S. Tsirkin  wrote:
> > >
> > > On Fri, Jul 10, 2020 at 09:13:41AM -0700, Alexander Duyck wrote:
> > > > On Fri, Jul 10, 2020 at 4:31 AM Michael S. Tsirkin  
> > > > wrote:
> > > > >



> > > As you say correctly the command id is actually assumed native endian:
> > >
> > >
> > > static u32 virtio_balloon_cmd_id_received(struct virtio_balloon *vb)
> > > {
> > > if (test_and_clear_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
> > >>config_read_bitmap))
> > > virtio_cread(vb->vdev, struct virtio_balloon_config,
> > >  free_page_hint_cmd_id,
> > >  >cmd_id_received_cache);
> > >
> > > return vb->cmd_id_received_cache;
> > > }
> > >
> > >
> > > So guest assumes native, host assumes LE.
> >
> > This wasn't even the one I was talking about, but now that you point
> > it out this is definately bug. The command ID I was talking about was
> > the one being passed via the descriptor ring. That one I believe is
> > native on both sides.
>
> Well qemu swaps it for modern devices:
>
> virtio_tswap32s(vdev, );
>
> guest swaps it too:
> vb->cmd_id_active = cpu_to_virtio32(vb->vdev,
> virtio_balloon_cmd_id_received(vb));
> sg_init_one(, >cmd_id_active, sizeof(vb->cmd_id_active));
> err = virtqueue_add_outbuf(vq, , 1, >cmd_id_active, 
> GFP_KERNEL);
>
> So it's native for legacy.

Okay, that makes sense. I just wasn't familiar with the virtio32 type.

I guess that just means we need to fix the original issue you found
where the guest was assuming native for the command ID in the config.
Do you plan to patch that or should I?

> > >
> > >
> > >
> > > > > ---
> > > > >  drivers/virtio/virtio_balloon.c | 9 +
> > > > >  1 file changed, 9 insertions(+)
> > > > >
> > > > > diff --git a/drivers/virtio/virtio_balloon.c 
> > > > > b/drivers/virtio/virtio_balloon.c
> > > > > index 5d4b891bf84f..b9bc03345157 100644
> > > > > --- a/drivers/virtio/virtio_balloon.c
> > > > > +++ b/drivers/virtio/virtio_balloon.c
> > > > > @@ -1107,6 +1107,15 @@ static int virtballoon_restore(struct 
> > > > > virtio_device *vdev)
> > > > >
> > > > >  static int virtballoon_validate(struct virtio_device *vdev)
> > > > >  {
> > > > > +   /*
> > > > > +* Legacy devices never specified how modern features should 
> > > > > behave.
> > > > > +* E.g. which endian-ness to use? Better not to assume 
> > > > > anything.
> > > > > +*/
> > > > > +   if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> > > > > +   __virtio_clear_bit(vdev, 
> > > > > VIRTIO_BALLOON_F_FREE_PAGE_HINT);
> > > > > +   __virtio_clear_bit(vdev, 
> > > > > VIRTIO_BALLOON_F_PAGE_POISON);
> > > > > +   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
> > > > > +   }
> > > > > /*
> > > > >  * Inform the hypervisor that our pages are poisoned or
> > > > >  * initialized. If we cannot do that then we should disable
> > > >
> > > > The patch content itself I am fine with since odds are nobody would
> > > > expect to use these features with a legacy device.
> > > >
> > > > Acked-by: Alexander Duyck 
> > >
> > > Hmm so now you pointed out it's just cmd id, maybe I should just fix it
> > > instead? what do you say?
> >
> > So the config issues are bugs, but I don't think you saw the one I was
> > talking about. In the function send_cmd_id_start the cmd_id_active
> > value which is initialized as a virtio32 is added as a sg entry and
> > then sent as an outbuf to the device. I'm assuming virtio32 is a host
> > native byte ordering.
>
> IIUC it isn't :) virtio32 is guest native if device is legacy, and LE if
> device is modern.

Okay. So I should probably document that for the spec I have been
working on. It looks like there is an example of similar documentation
for the memory statistics so it should be pretty straight forward.

Thanks.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_balloon: clear modern features under legacy

2020-07-13 Thread Alexander Duyck
On Sun, Jul 12, 2020 at 8:10 AM Michael S. Tsirkin  wrote:
>
> On Fri, Jul 10, 2020 at 09:13:41AM -0700, Alexander Duyck wrote:
> > On Fri, Jul 10, 2020 at 4:31 AM Michael S. Tsirkin  wrote:
> > >
> > > Page reporting features were never supported by legacy hypervisors.
> > > Supporting them poses a problem: should we use native endian-ness (like
> > > current code assumes)? Or little endian-ness like the virtio spec says?
> > > Rather than try to figure out, and since results of
> > > incorrect endian-ness are dire, let's just block this configuration.
> > >
> > > Cc: sta...@vger.kernel.org
> > > Signed-off-by: Michael S. Tsirkin 
> >
> > So I am not sure about the patch description. In the case of page
> > poison and free page reporting I don't think we are defining anything
> > that doesn't already have a definition of how to use in legacy.
> > Specifically the virtio_balloon_config is already defined as having
> > all fields as little endian in legacy mode, and there is a definition
> > for all of the fields in a virtqueue and how they behave in legacy
> > mode.
> >
> > As far as I can see the only item that may be an issue is the command
> > ID being supplied via the virtqueue for free page hinting, which
> > appears to be in native endian-ness. Otherwise it would have fallen
> > into the same category since it is making use of virtio_balloon_config
> > and a virtqueue for supplying the page location and length.
>
>
>
> So as you point out correctly balloon spec says all fields are little
> endian.  Fair enough.
> Problem is when virtio 1 is not negotiated, then this is not what the
> driver assumes for any except a handlful of fields.
>
> But yes it mostly works out.
>
> For example:
>
>
> static void update_balloon_size(struct virtio_balloon *vb)
> {
> u32 actual = vb->num_pages;
>
> /* Legacy balloon config space is LE, unlike all other devices. */
> if (!virtio_has_feature(vb->vdev, VIRTIO_F_VERSION_1))
> actual = (__force u32)cpu_to_le32(actual);
>
> virtio_cwrite(vb->vdev, struct virtio_balloon_config, actual,
>   );
> }
>
>
> this is LE even without VIRTIO_F_VERSION_1, so matches spec.
>
> /* Start with poison val of 0 representing general init */
> __u32 poison_val = 0;
>
> /*
>  * Let the hypervisor know that we are expecting a
>  * specific value to be written back in balloon pages.
>  */
> if (!want_init_on_free())
> memset(_val, PAGE_POISON, sizeof(poison_val));
>
> virtio_cwrite(vb->vdev, struct virtio_balloon_config,
>   poison_val, _val);
>
>
> actually this writes a native endian-ness value. All bytes happen to be
> the same though, and host only cares about 0 or non 0 ATM.

So we are safe assuming it is a repeating value, but for correctness
maybe we should make certain to cast this as a le32 value. I can
submit a patch to do that.

> As you say correctly the command id is actually assumed native endian:
>
>
> static u32 virtio_balloon_cmd_id_received(struct virtio_balloon *vb)
> {
> if (test_and_clear_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
>>config_read_bitmap))
> virtio_cread(vb->vdev, struct virtio_balloon_config,
>  free_page_hint_cmd_id,
>  >cmd_id_received_cache);
>
> return vb->cmd_id_received_cache;
> }
>
>
> So guest assumes native, host assumes LE.

This wasn't even the one I was talking about, but now that you point
it out this is definately bug. The command ID I was talking about was
the one being passed via the descriptor ring. That one I believe is
native on both sides.

>
>
>
> > > ---
> > >  drivers/virtio/virtio_balloon.c | 9 +
> > >  1 file changed, 9 insertions(+)
> > >
> > > diff --git a/drivers/virtio/virtio_balloon.c 
> > > b/drivers/virtio/virtio_balloon.c
> > > index 5d4b891bf84f..b9bc03345157 100644
> > > --- a/drivers/virtio/virtio_balloon.c
> > > +++ b/drivers/virtio/virtio_balloon.c
> > > @@ -1107,6 +1107,15 @@ static int virtballoon_restore(struct 
> > > virtio_device *vdev)
> > >
> > >  static int virtballoon_validate(struct virtio_device *vdev)
> > >  {
> > > +   /*
> > > +* Legacy devices never specified how modern features should 
> > 

Re: [PATCH] virtio_balloon: clear modern features under legacy

2020-07-10 Thread Alexander Duyck
On Fri, Jul 10, 2020 at 4:31 AM Michael S. Tsirkin  wrote:
>
> Page reporting features were never supported by legacy hypervisors.
> Supporting them poses a problem: should we use native endian-ness (like
> current code assumes)? Or little endian-ness like the virtio spec says?
> Rather than try to figure out, and since results of
> incorrect endian-ness are dire, let's just block this configuration.
>
> Cc: sta...@vger.kernel.org
> Signed-off-by: Michael S. Tsirkin 

So I am not sure about the patch description. In the case of page
poison and free page reporting I don't think we are defining anything
that doesn't already have a definition of how to use in legacy.
Specifically the virtio_balloon_config is already defined as having
all fields as little endian in legacy mode, and there is a definition
for all of the fields in a virtqueue and how they behave in legacy
mode.

As far as I can see the only item that may be an issue is the command
ID being supplied via the virtqueue for free page hinting, which
appears to be in native endian-ness. Otherwise it would have fallen
into the same category since it is making use of virtio_balloon_config
and a virtqueue for supplying the page location and length.

> ---
>  drivers/virtio/virtio_balloon.c | 9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 5d4b891bf84f..b9bc03345157 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -1107,6 +1107,15 @@ static int virtballoon_restore(struct virtio_device 
> *vdev)
>
>  static int virtballoon_validate(struct virtio_device *vdev)
>  {
> +   /*
> +* Legacy devices never specified how modern features should behave.
> +* E.g. which endian-ness to use? Better not to assume anything.
> +*/
> +   if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> +   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT);
> +   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
> +   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
> +   }
> /*
>  * Inform the hypervisor that our pages are poisoned or
>  * initialized. If we cannot do that then we should disable

The patch content itself I am fine with since odds are nobody would
expect to use these features with a legacy device.

Acked-by: Alexander Duyck 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 resubmit] virtio-balloon: Disable free page reporting if page poison reporting is not enabled

2020-05-26 Thread Alexander Duyck
Do I need to resubmit this patch? It has been over two weeks now since
it was originally submitted, and a week and a half since I last sent
out an email following up. I'm just wondering if there is some list I
missed as I am assuming the maintainers and lists I contacted here are
correct? It looks like we are at RC7 now so I am worried this won't
make it in before LInus releases 5.7.

Thanks.

- Alex

On Fri, May 15, 2020 at 10:02 AM Alexander Duyck
 wrote:
>
> Just following up. It has been a week since I submitted this. I was
> hoping we could get it in for 5.7 since this affects free page
> reporting which will be introduced with that kernel release.
>
> Thanks.
>
> - Alex
>
> On Fri, May 8, 2020 at 10:40 AM Alexander Duyck
>  wrote:
> >
> > From: Alexander Duyck 
> >
> > We should disable free page reporting if page poisoning is enabled but we
> > cannot report it via the balloon interface. This way we can avoid the
> > possibility of corrupting guest memory. Normally the page poisoning feature
> > should always be present when free page reporting is enabled on the
> > hypervisor, however this allows us to correctly handle a case of the
> > virtio-balloon device being possibly misconfigured.
> >
> > Fixes: 5d757c8d518d ("virtio-balloon: add support for providing free page 
> > reports to host")
> > Acked-by: David Hildenbrand 
> > Signed-off-by: Alexander Duyck 
> > ---
> >
> > Changes since v1:
> > Originally this patch also modified free page hinting, that has been 
> > removed.
> > Updated patch title and description.
> > Added a comment explaining reasoning for disabling free page reporting.
> >
> > Resbumitting v2 w/ Ack from David Hildebrand.
> >
> >  drivers/virtio/virtio_balloon.c |9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index 51086a5afdd4..1f157d2f4952 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -1107,11 +1107,18 @@ static int virtballoon_restore(struct virtio_device 
> > *vdev)
> >
> >  static int virtballoon_validate(struct virtio_device *vdev)
> >  {
> > -   /* Tell the host whether we care about poisoned pages. */
> > +   /*
> > +* Inform the hypervisor that our pages are poisoned or
> > +* initialized. If we cannot do that then we should disable
> > +* page reporting as it could potentially change the contents
> > +* of our free pages.
> > +*/
> > if (!want_init_on_free() &&
> > (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
> >  !page_poisoning_enabled()))
> > __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
> > +   else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
> > +   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
> >
> > __virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
> > return 0;
> >
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 resubmit] virtio-balloon: Disable free page reporting if page poison reporting is not enabled

2020-05-15 Thread Alexander Duyck
Just following up. It has been a week since I submitted this. I was
hoping we could get it in for 5.7 since this affects free page
reporting which will be introduced with that kernel release.

Thanks.

- Alex

On Fri, May 8, 2020 at 10:40 AM Alexander Duyck
 wrote:
>
> From: Alexander Duyck 
>
> We should disable free page reporting if page poisoning is enabled but we
> cannot report it via the balloon interface. This way we can avoid the
> possibility of corrupting guest memory. Normally the page poisoning feature
> should always be present when free page reporting is enabled on the
> hypervisor, however this allows us to correctly handle a case of the
> virtio-balloon device being possibly misconfigured.
>
> Fixes: 5d757c8d518d ("virtio-balloon: add support for providing free page 
> reports to host")
> Acked-by: David Hildenbrand 
> Signed-off-by: Alexander Duyck 
> ---
>
> Changes since v1:
> Originally this patch also modified free page hinting, that has been removed.
> Updated patch title and description.
> Added a comment explaining reasoning for disabling free page reporting.
>
> Resbumitting v2 w/ Ack from David Hildebrand.
>
>  drivers/virtio/virtio_balloon.c |9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index 51086a5afdd4..1f157d2f4952 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -1107,11 +1107,18 @@ static int virtballoon_restore(struct virtio_device 
> *vdev)
>
>  static int virtballoon_validate(struct virtio_device *vdev)
>  {
> -   /* Tell the host whether we care about poisoned pages. */
> +   /*
> +* Inform the hypervisor that our pages are poisoned or
> +* initialized. If we cannot do that then we should disable
> +* page reporting as it could potentially change the contents
> +* of our free pages.
> +*/
> if (!want_init_on_free() &&
> (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
>  !page_poisoning_enabled()))
> __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
> +   else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
> +   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
>
> __virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
> return 0;
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2 resubmit] virtio-balloon: Disable free page reporting if page poison reporting is not enabled

2020-05-08 Thread Alexander Duyck
From: Alexander Duyck 

We should disable free page reporting if page poisoning is enabled but we
cannot report it via the balloon interface. This way we can avoid the
possibility of corrupting guest memory. Normally the page poisoning feature
should always be present when free page reporting is enabled on the
hypervisor, however this allows us to correctly handle a case of the
virtio-balloon device being possibly misconfigured.

Fixes: 5d757c8d518d ("virtio-balloon: add support for providing free page 
reports to host")
Acked-by: David Hildenbrand 
Signed-off-by: Alexander Duyck 
---

Changes since v1:
Originally this patch also modified free page hinting, that has been removed.
Updated patch title and description.
Added a comment explaining reasoning for disabling free page reporting.

Resbumitting v2 w/ Ack from David Hildebrand.

 drivers/virtio/virtio_balloon.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 51086a5afdd4..1f157d2f4952 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1107,11 +1107,18 @@ static int virtballoon_restore(struct virtio_device 
*vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
-   /* Tell the host whether we care about poisoned pages. */
+   /*
+* Inform the hypervisor that our pages are poisoned or
+* initialized. If we cannot do that then we should disable
+* page reporting as it could potentially change the contents
+* of our free pages.
+*/
if (!want_init_on_free() &&
(IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
 !page_poisoning_enabled()))
__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+   else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
+   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
 
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] virtio-balloon: Disable free page reporting if page poison reporting is not enabled

2020-04-27 Thread Alexander Duyck
On Mon, Apr 27, 2020 at 8:09 AM David Hildenbrand  wrote:
>
> On 27.04.20 16:58, Alexander Duyck wrote:
> > On Mon, Apr 27, 2020 at 1:08 AM David Hildenbrand  wrote:
> >>
> >> On 24.04.20 18:24, Alexander Duyck wrote:
> >>> From: Alexander Duyck 
> >>>
> >>> We should disable free page reporting if page poisoning is enabled in the
> >>> kernel but we cannot report it via the balloon interface. This way we can
> >>> avoid the possibility of corrupting guest memory. Normally the page poison
> >>> reporting feature should always be present when free page reporting is
> >>> enabled on the hypervisor, however this allows us to correctly handle a
> >>> case of the virtio-balloon device being possibly misconfigured.
> >>>
> >>> Fixes: 5d757c8d518d ("virtio-balloon: add support for providing free page 
> >>> reports to host")
> >>> Signed-off-by: Alexander Duyck 
> >>> ---
> >>>
> >>> Changes since v1:
> >>> Originally this patch also modified free page hinting, that has been 
> >>> removed.
> >>> Updated patch title and description.
> >>> Added a comment explaining reasoning for disabling free page reporting.
> >>>
> >>>  drivers/virtio/virtio_balloon.c |9 -
> >>>  1 file changed, 8 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/drivers/virtio/virtio_balloon.c 
> >>> b/drivers/virtio/virtio_balloon.c
> >>> index 51086a5afdd4..1f157d2f4952 100644
> >>> --- a/drivers/virtio/virtio_balloon.c
> >>> +++ b/drivers/virtio/virtio_balloon.c
> >>> @@ -1107,11 +1107,18 @@ static int virtballoon_restore(struct 
> >>> virtio_device *vdev)
> >>>
> >>>  static int virtballoon_validate(struct virtio_device *vdev)
> >>>  {
> >>> - /* Tell the host whether we care about poisoned pages. */
> >>> + /*
> >>> +  * Inform the hypervisor that our pages are poisoned or
> >>> +  * initialized. If we cannot do that then we should disable
> >>> +  * page reporting as it could potentially change the contents
> >>> +  * of our free pages.
> >>> +  */
> >>>   if (!want_init_on_free() &&
> >>>   (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
> >>>!page_poisoning_enabled()))
> >>>   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
> >>> + else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
> >>> + __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
> >>>
> >>>   __virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
> >>>   return 0;
> >>>
> >>
> >> Did you see my feedback on v1?
> >>
> >> https://www.spinics.net/lists/linux-virtualization/msg42783.html
> >>
> >> In case of want_init_on_free(), we don't really need 
> >> VIRTIO_BALLOON_F_PAGE_POISON.
> >
> > I believe we do. We had discussions earlier in which Michael had
> > mentioned that the guest should not assume implementation details of
> > the host/hypervisor.
>
> We can simply document this, as I suggested already ("either the old
> page, or a page filled with zero"). Perfectly fine IMHO.
>
> >
> > The PAGE_POISON feature is being used to indicate that we expect the
> > page to contain a certain value when it is returned to us. With the
> > current implementation in QEMU if that value is zero we can work with
> > it because that is the effect that MADV_DONTNEED has. However if there
> > were some other effect being used by QEMU we would want to know that
> > the guest is expecting us to write a 0 page. So I believe it makes
> > sense to inform the hypervisor that we are doing something that
> > expects us to fill the page with zeros in the case of
>
> Informing makes sense, yes. And we inform it via the poison feature, if
> possible. This discussion is about "what happens if the poison feature
> is not there, but we do have want_init_on_free()."

My preference is to err on the side of caution.  I view want_init_on
free() as equivalent to page_poison_enabled with
CONFIG_PAGE_POISONING_ZERO and treated the same way. So if we are
going to let one by then we would want to let the other by as well.

> > want_init_on_free rather than not providing that information to QEMU.
> > This way, if in the future we decide to change the imple

Re: [PATCH v2] virtio-balloon: Disable free page reporting if page poison reporting is not enabled

2020-04-27 Thread Alexander Duyck
On Mon, Apr 27, 2020 at 1:08 AM David Hildenbrand  wrote:
>
> On 24.04.20 18:24, Alexander Duyck wrote:
> > From: Alexander Duyck 
> >
> > We should disable free page reporting if page poisoning is enabled in the
> > kernel but we cannot report it via the balloon interface. This way we can
> > avoid the possibility of corrupting guest memory. Normally the page poison
> > reporting feature should always be present when free page reporting is
> > enabled on the hypervisor, however this allows us to correctly handle a
> > case of the virtio-balloon device being possibly misconfigured.
> >
> > Fixes: 5d757c8d518d ("virtio-balloon: add support for providing free page 
> > reports to host")
> > Signed-off-by: Alexander Duyck 
> > ---
> >
> > Changes since v1:
> > Originally this patch also modified free page hinting, that has been 
> > removed.
> > Updated patch title and description.
> > Added a comment explaining reasoning for disabling free page reporting.
> >
> >  drivers/virtio/virtio_balloon.c |9 -
> >  1 file changed, 8 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/virtio/virtio_balloon.c 
> > b/drivers/virtio/virtio_balloon.c
> > index 51086a5afdd4..1f157d2f4952 100644
> > --- a/drivers/virtio/virtio_balloon.c
> > +++ b/drivers/virtio/virtio_balloon.c
> > @@ -1107,11 +1107,18 @@ static int virtballoon_restore(struct virtio_device 
> > *vdev)
> >
> >  static int virtballoon_validate(struct virtio_device *vdev)
> >  {
> > - /* Tell the host whether we care about poisoned pages. */
> > + /*
> > +  * Inform the hypervisor that our pages are poisoned or
> > +  * initialized. If we cannot do that then we should disable
> > +  * page reporting as it could potentially change the contents
> > +  * of our free pages.
> > +  */
> >   if (!want_init_on_free() &&
> >   (IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
> >!page_poisoning_enabled()))
> >   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
> > + else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
> > + __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
> >
> >   __virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
> >   return 0;
> >
>
> Did you see my feedback on v1?
>
> https://www.spinics.net/lists/linux-virtualization/msg42783.html
>
> In case of want_init_on_free(), we don't really need 
> VIRTIO_BALLOON_F_PAGE_POISON.

I believe we do. We had discussions earlier in which Michael had
mentioned that the guest should not assume implementation details of
the host/hypervisor.

The PAGE_POISON feature is being used to indicate that we expect the
page to contain a certain value when it is returned to us. With the
current implementation in QEMU if that value is zero we can work with
it because that is the effect that MADV_DONTNEED has. However if there
were some other effect being used by QEMU we would want to know that
the guest is expecting us to write a 0 page. So I believe it makes
sense to inform the hypervisor that we are doing something that
expects us to fill the page with zeros in the case of
want_init_on_free rather than not providing that information to QEMU.
This way, if in the future we decide to change the implementation in
some way that might effect the value contained in the pages, we might
have the flexibility to identify the want_init_on_free case so we can
work around it.

In reality it isn't too different from informing QEMU that we are
performing poison with a value of 0 anyway which if I recall is what
led me to adding the want_init_on_free check as they are all lumped
together in free_pages_prezeroed():
https://elixir.bootlin.com/linux/v5.7-rc3/source/mm/page_alloc.c#L2134

Thanks.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2] virtio-balloon: Disable free page reporting if page poison reporting is not enabled

2020-04-24 Thread Alexander Duyck
From: Alexander Duyck 

We should disable free page reporting if page poisoning is enabled in the
kernel but we cannot report it via the balloon interface. This way we can
avoid the possibility of corrupting guest memory. Normally the page poison
reporting feature should always be present when free page reporting is
enabled on the hypervisor, however this allows us to correctly handle a
case of the virtio-balloon device being possibly misconfigured.

Fixes: 5d757c8d518d ("virtio-balloon: add support for providing free page 
reports to host")
Signed-off-by: Alexander Duyck 
---

Changes since v1:
Originally this patch also modified free page hinting, that has been removed.
Updated patch title and description.
Added a comment explaining reasoning for disabling free page reporting.

 drivers/virtio/virtio_balloon.c |9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 51086a5afdd4..1f157d2f4952 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1107,11 +1107,18 @@ static int virtballoon_restore(struct virtio_device 
*vdev)
 
 static int virtballoon_validate(struct virtio_device *vdev)
 {
-   /* Tell the host whether we care about poisoned pages. */
+   /*
+* Inform the hypervisor that our pages are poisoned or
+* initialized. If we cannot do that then we should disable
+* page reporting as it could potentially change the contents
+* of our free pages.
+*/
if (!want_init_on_free() &&
(IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
 !page_poisoning_enabled()))
__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+   else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON))
+   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
 
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH] virtio-balloon: Disable free page hinting/reporting if page poison is disabled

2020-04-22 Thread Alexander Duyck
On Wed, Apr 22, 2020 at 3:24 AM David Hildenbrand  wrote:
>
> >>> What is the expectation there? I assume we are saying either
> >>> poison_val or unmodified? If so I would think the inflate case makes
> >>> much more sense as that is where the madvise is called that will
> >>> discard the data. If so it would be pretty easy to just add a check
> >>> for the poison value to the same spot we check
> >>> qemu_balloon_is_inhibited.
> >>
> >> Okay, we have basically no idea what was the intention of
> >> VIRTIO_BALLOON_F_PAGE_POISON with basic deflation/inflation as well. So
> >> I think we can define what suits us.
> >>
> >> On the deflate path, we could always simply fill with poison_val. But
> >> there are nasty corner cases (esp. no VIRTIO_BALLOON_F_MUST_TELL_HOST).
> >>
> >> What would be your suggestion? Also don't care about
> >> VIRTIO_BALLOON_F_PAGE_POISON on ordinary inflation/deflation? At this
> >> point, I think this makes sense.
> >
> > That is kind of what I was thinking. The problem is that once again
> > the current implementation works when page poisoning is enabled. Us
> > disabling that wouldn't make much sense.
> >
> > The whole thing with the reporting is that we are essentially just
> > ballooning in place. What we may do at some point in the future would
> > be to add an additional feature bit to do that for the standard
> > balloon/hinting case. Then when that is set, and we know the contents
> > won't match we can then just skip the madvise or hinting calls. That
> > way it becomes an opt-in which is what the poison was supposed to be,
> > but wasn't because the QEMU side was never implemented.
>
> Yeah, introducing this later makes sense.
>
> So VIRTIO_BALLOON_F_PAGE_POISON really means:
> - poison_val in the config is unlocked
> - when active, the guest is using page poisoning/init on free with
>   poison_val ("for you information")
> - it only changes the semantic of free page reporting, nothing else.
>   (when reusing reported pages in the guest, they will either have the
>   old content, or will be filled with poison_val.)
>
> Makes sense? That should be easy to document.

Yep, makes sense. In theory the old content or being filled with
poison_val should be the same thing.

> >
> > In the meantime I still have to make more changes to my QEMU patch
> > set. The way the config_size logic is implemented is somewhat of a
> > pain when you factor in the way the host_features and poison were
> > handled.
>
> Okay, I'll wait for updated QEMU patches.

I got to the root cause of the issues I was seeing. The config size
being dependent on the page poison feature was somewhat problematic as
it affects where I can place the setting of the bit since I have to
have it done before we call virtio_init. I should be submitting the
patches this afternoon. I am just going through and making sure I have
my bases covered and testing for any corner cases.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH] virtio-balloon: Disable free page hinting/reporting if page poison is disabled

2020-04-21 Thread Alexander Duyck
On Tue, Apr 21, 2020 at 8:18 AM David Hildenbrand  wrote:
>
> >>
> >> That leaves us with either
> >>
> >> 1. "Pages hinted via VIRTIO_BALLOON_F_FREE_PAGE_HINT might get replaced
> >> by zero page. However, as soon as the page is written by the guest (even
> >> before the hinting request was processed by the host), the modified page
> >> will stay - whereby the unwritten parts might either be from the old, or
> >> from the zero page." - a QEMU BUG.
> >
> > How is this a bug? This is the behavior you would see with the current
> > balloon driver. When you put a page into a balloon it has the option
> > to either madvise it away, defer it, or just skip it because he
> > balloon is disabled. Is the "zero page" a typo? If it was
> > uninitialized data that would be a bug, but I don't see how a zero
> > page or the old data would be a bug.
>
> Sorry, I meant if this was the original design idea of hinting, then we
> would have a QEMU BUG as of now, as we might get get uninitialized data.
> Makes sense?

Yes, that makes sense. So the bug is that we aren't doing this right now.

> >
> >>>
>  The current QEMU implementation would be to simply migrate all hinted
>  pages. In the future we could optimize. Not sure if it's worth the 
>  trouble.
> >>>
> >>> So the trick for me is how best to go about sorting this all out.
> >>> There are two ways I see of doing it.
> >>>
> >>> The first approach would be to just assume that hinting should be
> >>> disassociated from poisoning. If I go that route that would more
> >>> closely match up with what happened in QEMU. The downside is that it
> >>> locks in the unmodified/uninitialized behavior and would require pages
> >>> to be rewritten when they come out of the balloon. However this is the
> >>> behavior we have now so it would only require specification
> >>> documentation changes.
> >>
> >> Right now, for simplicity, I prefer this and define
> >> VIRTIO_BALLOON_F_PAGE_POISON only for explicit deflation (balloon
> >> deflation via the deflate queue) and implicit deflation
> >> (VIRTIO_BALLOON_F_REPORTING).
> >
> > I don't recall us talking about the explicit deflation case before.
>
> The interesting part is that drivers/virtio/virtio_balloon.c mentions:
>
> "Let the hypervisor know that we are expecting a specific value to be
> written back in balloon pages.".
>
> But I just realized that you introduced this comment, not the original
> VIRTIO_BALLOON_F_PAGE_POISON commit.
>
> Should this have been "in reported pages when implicitly deflating them
> by reusing them." or sth. like that?

Yeah, probably. I should have probably used "reported" instead of
"balloon" in the comment.

> > What is the expectation there? I assume we are saying either
> > poison_val or unmodified? If so I would think the inflate case makes
> > much more sense as that is where the madvise is called that will
> > discard the data. If so it would be pretty easy to just add a check
> > for the poison value to the same spot we check
> > qemu_balloon_is_inhibited.
>
> Okay, we have basically no idea what was the intention of
> VIRTIO_BALLOON_F_PAGE_POISON with basic deflation/inflation as well. So
> I think we can define what suits us.
>
> On the deflate path, we could always simply fill with poison_val. But
> there are nasty corner cases (esp. no VIRTIO_BALLOON_F_MUST_TELL_HOST).
>
> What would be your suggestion? Also don't care about
> VIRTIO_BALLOON_F_PAGE_POISON on ordinary inflation/deflation? At this
> point, I think this makes sense.

That is kind of what I was thinking. The problem is that once again
the current implementation works when page poisoning is enabled. Us
disabling that wouldn't make much sense.

The whole thing with the reporting is that we are essentially just
ballooning in place. What we may do at some point in the future would
be to add an additional feature bit to do that for the standard
balloon/hinting case. Then when that is set, and we know the contents
won't match we can then just skip the madvise or hinting calls. That
way it becomes an opt-in which is what the poison was supposed to be,
but wasn't because the QEMU side was never implemented.

In the meantime I still have to make more changes to my QEMU patch
set. The way the config_size logic is implemented is somewhat of a
pain when you factor in the way the host_features and poison were
handled.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH] virtio-balloon: Disable free page hinting/reporting if page poison is disabled

2020-04-21 Thread Alexander Duyck
On Tue, Apr 21, 2020 at 12:29 AM David Hildenbrand  wrote:
>
>  >>> 2. Can we assume that the guest will always rewrite the page after the
> >>> deflate in the case of init_on_free or poison?
> >>
> >> Depends on what we think is the right way to do - IOW if we think "some
> >> other content" as mentioned above is a BUG or not.
> >
> > So I wouldn't consider it a but as the zero page probably doesn't
> > apply. We are basically just indicating we don't care about the
> > contents, we aren't giving it a value. At least that is how I see it
> > working based on how it was implemented.
> >
> >>> 3. Can we assume that free page hinting will always function as a
> >>> balloon setup, so no moving it over to a page reporting type setup?
> >>
> >> I think we have to define the valid semantics. That implies what would
> >> be valid to do with it. Unfortunately, we have to reverse-engineer here.
> >
> > So at this point a "hinted" page is a page that can essentially
> > transition to uninitialized while it is sitting in the balloon. I
> > suspect that is how we are going to have to treat it since that is the
> > behavior that it has right now.
>
> At least it's not what Michael initially thought should be done - IIRC.
>
> "We need to remember that the whole HINT thing is not a mandate for host
> to corrupt memory. It's just some info about page use guest gives host.
> If host corrupts memory it's broken ...").
>
> So the question remains: Does QEMU have a BUG or do we actually allow to
> corrupt guest memory.
>
> That leaves us with either
>
> 1. "Pages hinted via VIRTIO_BALLOON_F_FREE_PAGE_HINT might get replaced
> by zero page. However, as soon as the page is written by the guest (even
> before the hinting request was processed by the host), the modified page
> will stay - whereby the unwritten parts might either be from the old, or
> from the zero page." - a QEMU BUG.

How is this a bug? This is the behavior you would see with the current
balloon driver. When you put a page into a balloon it has the option
to either madvise it away, defer it, or just skip it because he
balloon is disabled. Is the "zero page" a typo? If it was
uninitialized data that would be a bug, but I don't see how a zero
page or the old data would be a bug.

The caveat for the hinting is that if the page is modified from the
point it is placed on the ring the dirty flag will be enforced for it
and will not be skipped as it will be captured in the next bitmap
sync.

> 2. "Pages hinted via VIRTIO_BALLOON_F_FREE_PAGE_HINT are considered
> unused and will contain an undefined/uninitialized content once hinted.
> As soon as the page is written by the guest (even before the hinting
> request was processed by the host), the modified page will stay. The
> guest should reinitialized the full page in case it cares about the
> actual content (e.g., page poisoning)."
>
>
> I tend to favor 2 - although it basically leaves no room for future
> improvement regarding skipping re-initialization in the guest after
> migration.

I agree. The main advantage would be that we get to keep all of the
existing functionality and wouldn't have to shut things down for
poison being enabled. However we are limited in that any future design
won't be able to skip over having to have the guest re-poison the
pages.

However making changes to behave more like 1 would break existing use
cases since right now page poisoning can be enabled and the guest can
make it work. If we refactored QEMU to make 1 work then we would lose
that.

> >>>
> >>> If we assume the above 3 items then there isn't any point in worrying
> >>> about poison when it comes to free page hinting. It doesn't make sense
> >>> to since they essentially negate it. As such I could go through this
> >>> patch and the QEMU patches and clean up any associations since the to
> >>> are not really tied together in any way.
> >>
> >> The big question is, if we want to support VIRTIO_BALLOON_F_PAGE_POISON
> >> with free page hinting. e.g.,:
> >>
> >> "Pages hinted via VIRTIO_BALLOON_F_FREE_PAGE_HINT might get replaced by
> >> a page full of X. However, as soon as the page is written by the guest
> >> (even before the hinting request was processed by the host), the
> >> modified page will stay - whereby the unwritten parts might either be
> >> from the old, or from a page filled with X. Without
> >> VIRTIO_BALLOON_F_PAGE_POISON, X is zero, otherwise it is poison_val."
>
> [...]
>
> >
> > With the VIRTIO_BALLOON_F_PAGE_POISON we could make it so that when
> > the page comes out of the balloon it is either unmodified or it is
> > poison_val. Without the VIRTIO_BALLOON_F_PAGE_POISON feature present
> > you cannot make that guarantee and are stuck with the page being
> > treated as either unmodified or uninitialized memory.
>
> While it would be possible, I doubt it will be easy to implement, and I
> still wonder if we should really care about optimizing an undocumented
> feature that takes three people to figure out the 

Re: [virtio-dev] Re: [PATCH] virtio-balloon: Disable free page hinting/reporting if page poison is disabled

2020-04-20 Thread Alexander Duyck
On Mon, Apr 20, 2020 at 6:28 AM David Hildenbrand  wrote:
>
> >>> Now we are talking about what's safe to do with the page.
> >>>
> >>> If POISON flag is set by hypervisor but clear by guest,
> >>> or poison_val is 0, then it's clear it's safe to blow
> >>> away the page if we can figure out it's unused.
> >>>
> >>> Otherwise, it's much less clear :)
> >>
> >> Hah! Agreed :D
> >
> > That isn't quite true. The problem is in the case of hinting it isn't
> > setting the page to 0. It is simply not migrating it. So if there is
> > data from an earlier pass it is stuck at that value. So the balloon
> > will see the poison/init on some pages cleared, however I suppose the
> > balloon doesn't care about the contents of the page. For the pages
> > that are leaked back out via the shrinker they will be dirtied so they
> > will end up being migrated with the correct value eventually.
>
> Right, I think current Linux guests are fine. The critical piece we are
> talking about is
>
> 1) Guest balloon allocates and hints a page
> 2) Hypervisor does not process hinting request
> 3) Guest frees the page and reuses it (especially, writes it).
> 4) Hypervisor processes the hinting request.
>
> AFAIU, as soon as the guest writes the page (e.g., zeroes it/poisons it
> in the buddy, or somebody who allocated it), the page *will* get
> migrated, even if 4) happens after 3). That's guaranteed by the 2-bitmap
> magic.

Yes. Basically the page will be flagged as dirty as soon as it is
written to, and that will be dealt with after the current batch of
hints are processed so we should be able to guarantee that the page
will have a coherent value stored in it.

> Now, assume the following happens (in some future Linux version) (due to
> your "simply not migrating it" comment):
>
> 1) Guest balloon allocates and hints a page. Assume the page is zero due
> to want_init_on_free().
> 2) Hypervisor processes the hinting request.
> 3) Guest frees the page. Assume we are implementing some magic to "skip"
> zeroing, as we assume it is still zero.
>
> Due to 2), the page won't get migrated. In 3) we expect the page to be
> 0. QEMU would have to make sure that we always get either the original,
> or a zero page on the destination. Otherwise, this smells like data
> corruption.

I agree and that is my concern. From the time the page is hinted until
it is written to again it is in an indeterminate state. However with
the current Linux guest implementation it is being written to so
technically there isn't an issue. We would just need to make sure it
stays that way.

In addition it is already an exposed interface and this is the way it
works. I think if anything we are likely forced to document it as-is
and guarantee that the behavior is not changed.

> >
> >>> I'll have to come back and re-read the rest next week, this
> >>> is complex stuff and I'm too rushed with other things today.
> >>
> >> Yeah, I'm also loaded with other stuff. Maybe Alex has time to
> >> understand the details as well.
> >
> > So after looking it over again it makes a bit more sense why this
> > hasn't caused any issues so far, and I can see why the poison enabled
> > setup and hinting can work. The problem is I am left wondering what
> > assumptions we are allowed to leave in place.
> >
> > 1. Can we assume that we don't care about the contents in the pages in
> > the balloon changing?
>
> I think, we should define valid ways for the hypervisor to change it.
>
> "Pages hinted via VIRTIO_BALLOON_F_FREE_PAGE_HINT might get replaced by
> a zero page. However, as soon as the page is written by the guest (even
> before the hinting request was processed by the host), the modified page
> will stay - whereby the unwritten parts might either be from the old, or
> from the zero page."

Actually I don't think the zero page is accurate. A page that is
hinted basically implies we don't care about the content. I would
think that we could treat a hinted page as uninitialized memory.

> I think the debatable part is "whereby the unwritten parts might either
> be from the old, or from the zero page". AFAIU, you think it could
> happen in QEMU, that we have neither the old, nor the zero page, but
> instead some previous content. The question is if that's valid, or if
> that's a BUG in QEMU. If it's valid, we can do no optimizations in the
> guest (e.g., skip zeroing in the buddy). And I agree that this smells
> like "data corruption" as Michael said.

So if any part of the page is written to after it is hinted you will
be getting the modified page since it goes back to being "dirty". All
the hinting is doing is skipping the migration of dirty pages that are
"hinted" as long as they are not written to again. So the pages are
valid up until we migrate to the new system. At that point all of the
pages that are in the page hinting balloon will be stale data as we
skipped migrating them. Assuming something like page poisoning or
init_on_free are enabled they will be poisoned again when they 

Re: [virtio-dev] Re: [PATCH] virtio-balloon: Disable free page hinting/reporting if page poison is disabled

2020-04-17 Thread Alexander Duyck
On Fri, Apr 17, 2020 at 3:09 AM David Hildenbrand  wrote:
>
>  > What do you call "hinting ends" though? The fact we put
> > a page in the VQ is not a guarantee that it's been consumed
> > by the hypervisor.
> >
>
> I'd say hinting ends once the hypervisor sets FREE_PAGE_REPORT_S_DONE.

The key bit to this is that there are 4 states, and quasi unlimited
command IDs, although I believe the first 2 are matched up to the
states. So the VIRTIO_BALLOON_CMD_ID_DONE is matched up with
FREE_PAGE_REPORT_S_DONE, and CMD_ID_STOP with S_STOP, but really all
it means is that we are done with the current epoch so we need to
flush the memory and move on. The state is more important to the
hypervisor as it will switch to "STOP" while it is synching the dirty
bits, "REQUESTED" once that has been completed and it will increment
the command ID, "START" once the first hint is received with the
matching command ID, and "DONE" once the migration is complete. As
long as it is in the "START" state and the command ID in the hint
matches it will use the information to clear the dirty bits as it runs
in parallel with the migration task.

The piece I think I was missing was that the balloon is staying
(mostly) inflated until the migration is complete. If that is the case
then I suppose we could leave this enabled at least on the guest side,
assuming the balloon doesn't give back too many pages when it hits the
shrinker.

> >
> > I think a strict definition is this:
> > - hint includes a command ID
> > - hint implies "page was unused at some point after guest reading command 
> > ID"
> >
> >
> > Hypervisor can use dirty tracking tricks to get from that to
> > "page is unused at the moment".
> >
> >> Whereby X is
> >> currently assumed to be 0, correct?
> >
> >
> >
> > Now we are talking about what's safe to do with the page.
> >
> > If POISON flag is set by hypervisor but clear by guest,
> > or poison_val is 0, then it's clear it's safe to blow
> > away the page if we can figure out it's unused.
> >
> > Otherwise, it's much less clear :)
>
> Hah! Agreed :D

That isn't quite true. The problem is in the case of hinting it isn't
setting the page to 0. It is simply not migrating it. So if there is
data from an earlier pass it is stuck at that value. So the balloon
will see the poison/init on some pages cleared, however I suppose the
balloon doesn't care about the contents of the page. For the pages
that are leaked back out via the shrinker they will be dirtied so they
will end up being migrated with the correct value eventually.

> > I'll have to come back and re-read the rest next week, this
> > is complex stuff and I'm too rushed with other things today.
>
> Yeah, I'm also loaded with other stuff. Maybe Alex has time to
> understand the details as well.

So after looking it over again it makes a bit more sense why this
hasn't caused any issues so far, and I can see why the poison enabled
setup and hinting can work. The problem is I am left wondering what
assumptions we are allowed to leave in place.

1. Can we assume that we don't care about the contents in the pages in
the balloon changing?
2. Can we assume that the guest will always rewrite the page after the
deflate in the case of init_on_free or poison?
3. Can we assume that free page hinting will always function as a
balloon setup, so no moving it over to a page reporting type setup?

If we assume the above 3 items then there isn't any point in worrying
about poison when it comes to free page hinting. It doesn't make sense
to since they essentially negate it. As such I could go through this
patch and the QEMU patches and clean up any associations since the to
are not really tied together in any way.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH] virtio-balloon: Disable free page hinting/reporting if page poison is disabled

2020-04-16 Thread Alexander Duyck
On Thu, Apr 16, 2020 at 3:13 PM Michael S. Tsirkin  wrote:
>
> On Thu, Apr 16, 2020 at 12:30:38PM -0700, Alexander Duyck wrote:
> > From: Alexander Duyck 
> >
> > If we have free page hinting or page reporting enabled we should disable it
> > if the pages are poisoned or initialized on free and we cannot notify the
> > hypervisor.
> >
> > Fixes: 5d757c8d518d ("virtio-balloon: add support for providing free page 
> > reports to host")
> >
> > Signed-off-by: Alexander Duyck 
>
>
> Why not put this logic in the hypervisor?

I did that too. This is to cover the case where somebody is running
the code prior to my QEMU changes where the page poison feature wasn't
being enabled.

> We don't know what hypervisor uses the hints for.

I agree, but at the same time the way the feature was originally coded
it was only checked if the FREE_PAGE_HINT feature was enabled. The
assumption there is that if we have page poison data and want to use
hints we need to report it. In my mind if we ever want to switch over
to the page reporting style setup for page hinting in the future we
will need to have it behave in a sane manner. So disabling it if we
have a poison value we need to report, but have no mechanism to report
it makes sense to me.

The actual likelihood of us encountering this case should be pretty
low anyway since it is not that common to have page poisoning or
init_on_free enabled.

> Yes you can not just drop them but you can maybe do
> other things such as MADV_SOFT_OFFLINE.
>
> Finally, VIRTIO_BALLOON_F_FREE_PAGE_HINT does nothing
> at all unless guest gets the command from hypervisor,
> so there isn't even any overhead.

The problem is we cannot communicate the full situation to the
hypervisor without the page poison feature being present. As such I
would worry about that backfiring on us due to the hypervisor acting
on incomplete data.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio-balloon: Disable free page hinting/reporting if page poison is disabled

2020-04-16 Thread Alexander Duyck
From: Alexander Duyck 

If we have free page hinting or page reporting enabled we should disable it
if the pages are poisoned or initialized on free and we cannot notify the
hypervisor.

Fixes: 5d757c8d518d ("virtio-balloon: add support for providing free page 
reports to host")

Signed-off-by: Alexander Duyck 
---
 drivers/virtio/virtio_balloon.c |6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 95d9c2f0a7be..08bc86a6e468 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -1110,8 +1110,12 @@ static int virtballoon_validate(struct virtio_device 
*vdev)
/* Tell the host whether we care about poisoned pages. */
if (!want_init_on_free() &&
(IS_ENABLED(CONFIG_PAGE_POISONING_NO_SANITY) ||
-!page_poisoning_enabled()))
+!page_poisoning_enabled())) {
__virtio_clear_bit(vdev, VIRTIO_BALLOON_F_PAGE_POISON);
+   } else if (!virtio_has_feature(vdev, VIRTIO_BALLOON_F_PAGE_POISON)) {
+   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT);
+   __virtio_clear_bit(vdev, VIRTIO_BALLOON_F_REPORTING);
+   }
 
__virtio_clear_bit(vdev, VIRTIO_F_IOMMU_PLATFORM);
return 0;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2] virtio-balloon: Avoid using the word 'report' when referring to free page hinting

2020-04-15 Thread Alexander Duyck
From: Alexander Duyck 

It can be confusing to have multiple features within the same driver that
are using the same verbage. As such this patch is creating a union of
free_page_report_cmd_id with free_page_hint_cmd_id so that we can clean-up
the userspace code a bit in terms of readability while maintaining the
functionality of legacy code.

Signed-off-by: Alexander Duyck 
---
 drivers/virtio/virtio_balloon.c |2 +-
 include/uapi/linux/virtio_balloon.h |   13 ++---
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0ef16566c3f3..95d9c2f0a7be 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -580,7 +580,7 @@ static u32 virtio_balloon_cmd_id_received(struct 
virtio_balloon *vb)
if (test_and_clear_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
   >config_read_bitmap))
virtio_cread(vb->vdev, struct virtio_balloon_config,
-free_page_report_cmd_id,
+free_page_hint_cmd_id,
 >cmd_id_received_cache);
 
return vb->cmd_id_received_cache;
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 19974392d324..3ce64f72bd09 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -34,7 +34,7 @@
 #define VIRTIO_BALLOON_F_MUST_TELL_HOST0 /* Tell before reclaiming 
pages */
 #define VIRTIO_BALLOON_F_STATS_VQ  1 /* Memory Stats virtqueue */
 #define VIRTIO_BALLOON_F_DEFLATE_ON_OOM2 /* Deflate balloon on OOM */
-#define VIRTIO_BALLOON_F_FREE_PAGE_HINT3 /* VQ to report free pages */
+#define VIRTIO_BALLOON_F_FREE_PAGE_HINT3 /* Free page hinting VQ and 
config */
 #define VIRTIO_BALLOON_F_PAGE_POISON   4 /* Guest is using page poisoning */
 #define VIRTIO_BALLOON_F_REPORTING 5 /* Page reporting virtqueue */
 
@@ -48,8 +48,15 @@ struct virtio_balloon_config {
__u32 num_pages;
/* Number of pages we've actually got in balloon. */
__u32 actual;
-   /* Free page report command id, readonly by guest */
-   __u32 free_page_report_cmd_id;
+   /*
+* Free page hint command id, readonly by guest.
+* Was previously named free_page_report_cmd_id so we
+* need to carry that name for legacy support.
+*/
+   union {
+   __u32 free_page_hint_cmd_id;
+   __u32 free_page_report_cmd_id;  /* deprecated */
+   };
/* Stores PAGE_POISON if page poisoning is in use */
__u32 poison_val;
 };

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-balloon: Avoid using the word 'report' when referring to free page hinting

2020-04-15 Thread Alexander Duyck
On Wed, Apr 15, 2020 at 10:46 AM David Hildenbrand  wrote:
> Can we also change the comment of
>
> #define VIRTIO_BALLOON_F_FREE_PAGE_HINT  /* VQ to report free pages */
>
> ?
>
> Thanks!

No problem. I will update and submit a v2.

Thanks.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio-balloon: Avoid using the word 'report' when referring to free page hinting

2020-04-15 Thread Alexander Duyck
From: Alexander Duyck 

It can be confusing to have multiple features within the same driver that
are using the same verbage. As such this patch is creating a union of
free_page_report_cmd_id with free_page_hint_cmd_id so that we can clean-up
the userspace code a bit in terms of readability while maintaining the
functionality of legacy code.

Signed-off-by: Alexander Duyck 
---
 drivers/virtio/virtio_balloon.c |2 +-
 include/uapi/linux/virtio_balloon.h |   11 +--
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 0ef16566c3f3..95d9c2f0a7be 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -580,7 +580,7 @@ static u32 virtio_balloon_cmd_id_received(struct 
virtio_balloon *vb)
if (test_and_clear_bit(VIRTIO_BALLOON_CONFIG_READ_CMD_ID,
   >config_read_bitmap))
virtio_cread(vb->vdev, struct virtio_balloon_config,
-free_page_report_cmd_id,
+free_page_hint_cmd_id,
 >cmd_id_received_cache);
 
return vb->cmd_id_received_cache;
diff --git a/include/uapi/linux/virtio_balloon.h 
b/include/uapi/linux/virtio_balloon.h
index 19974392d324..dc3e656470dd 100644
--- a/include/uapi/linux/virtio_balloon.h
+++ b/include/uapi/linux/virtio_balloon.h
@@ -48,8 +48,15 @@ struct virtio_balloon_config {
__u32 num_pages;
/* Number of pages we've actually got in balloon. */
__u32 actual;
-   /* Free page report command id, readonly by guest */
-   __u32 free_page_report_cmd_id;
+   /*
+* Free page hint command id, readonly by guest.
+* Was previously named free_page_report_cmd_id so we
+* need to carry that name for legacy support.
+*/
+   union {
+   __u32 free_page_hint_cmd_id;
+   __u32 free_page_report_cmd_id;  /* deprecated */
+   };
/* Stores PAGE_POISON if page poisoning is in use */
__u32 poison_val;
 };

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_pci: support enabling VFs

2018-05-31 Thread Alexander Duyck
On Wed, May 30, 2018 at 8:20 PM, Tiwei Bie  wrote:
> On Thu, May 31, 2018 at 01:11:37AM +0800, Rustad, Mark D wrote:
>> On May 30, 2018, at 9:54 AM, Duyck, Alexander H
>>  wrote:
>>
>> > On Wed, 2018-05-30 at 09:44 -0700, Rustad, Mark D wrote:
>> > > On May 30, 2018, at 9:22 AM, Michael S. Tsirkin  wrote:
>> > >
>> > > > > +static int virtio_pci_sriov_configure(struct pci_dev *pci_dev, int
>> > > > > num_vfs)
>> > > > > +{
>> > > > > + struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
>> > > > > + struct virtio_device *vdev = _dev->vdev;
>> > > > > + int (*sriov_configure)(struct pci_dev *pci_dev, int num_vfs);
>> > > > > +
>> > > > > + if (!(vdev->config->get_status(vdev) & 
>> > > > > VIRTIO_CONFIG_S_DRIVER_OK))
>> > > > > + return -EBUSY;
>> > > > > +
>> > > > > + if (!__virtio_test_bit(vdev, VIRTIO_F_SR_IOV))
>> > > > > + return -EINVAL;
>> > > > > +
>> > > > > + sriov_configure = pci_sriov_configure_simple;
>> > > > > + if (sriov_configure == NULL)
>> > > > > + return -ENOENT;
>> > > >
>> > > > BTW what is all this trickery in aid of?
>> > >
>> > > When SR-IOV support is not compiled into the kernel,
>> > > pci_sriov_configure_simple is #defined as NULL. This allows it to compile
>> > > in that case, even though there is utterly no way for it to be called in
>> > > that case. It is an alternative to #ifs in the code.
>> >
>> > Why even have the call though? I would wrap all of this in an #ifdef
>> > and strip it out since you couldn't support SR-IOV if it isn't present
>> > in the kernel anyway.
>>
>> I am inclined to agree. In this case, the presence of #ifdefs I think would
>> be clearer. As written, someone will want to get rid of the pointer only to
>> create a build problem when SR-IOV is not configured.
>
> In my opinion, maybe it would be better to make
> pci_sriov_configure_simple() always available
> just like other sriov functions.
>
> Based on the comments in the original patch:
>
> https://patchwork.kernel.org/patch/10353197/
> """
> +/* this is expected to be used as a function pointer, just define as NULL */
> +#define pci_sriov_configure_simple NULL
> """
>
> This function could be defined as NULL just because
> it was expected to be used as a function pointer.
> But actually it could be called directly as a
> function, just like this case.
>
> So I prefer to make this function always available
> just like other sriov functions.
>
> Best regards,
> Tiwei Bie

The fact that you are having to add additional code kind of implies
that maybe this doesn't fall into the pci_sriov_configure_simple case
anymore. The PF itself is defining what the VF can and can't do via
the feature flags you are testing for.

For example how is the bit of code below valid if the kernel itself
doesn't support SR-IOV:
+static void vp_transport_features(struct virtio_device *vdev, u64 features)
+{
+   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
+   struct pci_dev *pci_dev = vp_dev->pci_dev;
+
+   if ((features & BIT_ULL(VIRTIO_F_SR_IOV)) &&
+   pci_find_ext_capability(pci_dev, PCI_EXT_CAP_ID_SRIOV))
+   __virtio_set_bit(vdev, VIRTIO_F_SR_IOV);
+}
+

It really seems like we should be wrapping these functions at the very
minimum so that they don't imply you have SR-IOV support when it isn't
supported in the kernel.

Also it seems like we should be disabling the VFs if the driver is
unbound from this interface. We need to add logic to disable SR-IOV if
the driver is removed. What we don't want to do is leave VFs allocated
and then have the potential for us to unbind/rebind the driver as the
new driver may change the negotiated features.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] [PATCH] virtio_pci: support enabling VFs

2018-05-30 Thread Alexander Duyck
On Wed, May 30, 2018 at 1:55 AM, Tiwei Bie  wrote:
> There is a new feature bit allocated in virtio spec to
> support SR-IOV (Single Root I/O Virtualization):
>
> https://github.com/oasis-tcs/virtio-spec/issues/11
>
> This patch enables the support for this feature bit in
> virtio driver.
>
> Signed-off-by: Tiwei Bie 

So from a quick glance it looks like we are leaving SR-IOV enabled if
the driver is removed. Do we want to have that behavior or should we
be adding the code to disable SR-IOV and free the VFs on driver
removal?

> ---
> This patch depends on below proposal:
> https://lists.oasis-open.org/archives/virtio-dev/201805/msg00154.html
>
> This patch depends on below patch:
> https://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci.git/commit/?h=pci/virtualization=8effc395c2097e258fcedfc02ed4a66d45fb4238
>
>  drivers/virtio/virtio_pci_common.c | 20 
>  drivers/virtio/virtio_pci_modern.c | 14 ++
>  include/uapi/linux/virtio_config.h |  7 ++-
>  3 files changed, 40 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/virtio/virtio_pci_common.c 
> b/drivers/virtio/virtio_pci_common.c
> index 48d4d1cf1cb6..023da80a7a3e 100644
> --- a/drivers/virtio/virtio_pci_common.c
> +++ b/drivers/virtio/virtio_pci_common.c
> @@ -588,6 +588,25 @@ static void virtio_pci_remove(struct pci_dev *pci_dev)
> put_device(dev);
>  }
>
> +static int virtio_pci_sriov_configure(struct pci_dev *pci_dev, int num_vfs)
> +{
> +   struct virtio_pci_device *vp_dev = pci_get_drvdata(pci_dev);
> +   struct virtio_device *vdev = _dev->vdev;
> +   int (*sriov_configure)(struct pci_dev *pci_dev, int num_vfs);
> +
> +   if (!(vdev->config->get_status(vdev) & VIRTIO_CONFIG_S_DRIVER_OK))
> +   return -EBUSY;
> +
> +   if (!__virtio_test_bit(vdev, VIRTIO_F_SR_IOV))
> +   return -EINVAL;
> +
> +   sriov_configure = pci_sriov_configure_simple;
> +   if (sriov_configure == NULL)
> +   return -ENOENT;
> +
> +   return sriov_configure(pci_dev, num_vfs);
> +}
> +
>  static struct pci_driver virtio_pci_driver = {
> .name   = "virtio-pci",
> .id_table   = virtio_pci_id_table,
> @@ -596,6 +615,7 @@ static struct pci_driver virtio_pci_driver = {
>  #ifdef CONFIG_PM_SLEEP
> .driver.pm  = _pci_pm_ops,
>  #endif
> +   .sriov_configure = virtio_pci_sriov_configure,
>  };
>
>  module_pci_driver(virtio_pci_driver);
> diff --git a/drivers/virtio/virtio_pci_modern.c 
> b/drivers/virtio/virtio_pci_modern.c
> index 2555d80f6eec..07571daccfec 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -153,14 +153,28 @@ static u64 vp_get_features(struct virtio_device *vdev)
> return features;
>  }
>
> +static void vp_transport_features(struct virtio_device *vdev, u64 features)
> +{
> +   struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> +   struct pci_dev *pci_dev = vp_dev->pci_dev;
> +
> +   if ((features & BIT_ULL(VIRTIO_F_SR_IOV)) &&
> +   pci_find_ext_capability(pci_dev, 
> PCI_EXT_CAP_ID_SRIOV))
> +   __virtio_set_bit(vdev, VIRTIO_F_SR_IOV);
> +}
> +
>  /* virtio config->finalize_features() implementation */
>  static int vp_finalize_features(struct virtio_device *vdev)
>  {
> struct virtio_pci_device *vp_dev = to_vp_device(vdev);
> +   u64 features = vdev->features;
>
> /* Give virtio_ring a chance to accept features. */
> vring_transport_features(vdev);
>
> +   /* Give virtio_pci a chance to accept features. */
> +   vp_transport_features(vdev, features);
> +
> if (!__virtio_test_bit(vdev, VIRTIO_F_VERSION_1)) {
> dev_err(>dev, "virtio: device uses modern interface "
> "but does not have VIRTIO_F_VERSION_1\n");
> diff --git a/include/uapi/linux/virtio_config.h 
> b/include/uapi/linux/virtio_config.h
> index 308e2096291f..b7c1f4e7d59e 100644
> --- a/include/uapi/linux/virtio_config.h
> +++ b/include/uapi/linux/virtio_config.h
> @@ -49,7 +49,7 @@
>   * transport being used (eg. virtio_ring), the rest are per-device feature
>   * bits. */
>  #define VIRTIO_TRANSPORT_F_START   28
> -#define VIRTIO_TRANSPORT_F_END 34
> +#define VIRTIO_TRANSPORT_F_END 38
>
>  #ifndef VIRTIO_CONFIG_NO_LEGACY
>  /* Do we get callbacks when the ring is completely used, even if we've
> @@ -71,4 +71,9 @@
>   * this is for compatibility with legacy systems.
>   */
>  #define VIRTIO_F_IOMMU_PLATFORM33
> +
> +/*
> + * Does the device support Single Root I/O Virtualization?
> + */
> +#define VIRTIO_F_SR_IOV37
>  #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
> --
> 2.17.0
>
>
> -
> To unsubscribe, e-mail: virtio-dev-unsubscr...@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-h...@lists.oasis-open.org
>

Re: [virtio-dev] Re: [PATCH v7 net-next 2/4] net: Introduce generic failover module

2018-04-20 Thread Alexander Duyck
On Fri, Apr 20, 2018 at 8:34 AM, Michael S. Tsirkin  wrote:
> On Fri, Apr 20, 2018 at 08:21:00AM -0700, Samudrala, Sridhar wrote:
>> > > + finfo = netdev_priv(failover_dev);
>> > > +
>> > > + primary_dev = rtnl_dereference(finfo->primary_dev);
>> > > + standby_dev = rtnl_dereference(finfo->standby_dev);
>> > > +
>> > > + if (slave_dev != primary_dev && slave_dev != standby_dev)
>> > > + goto done;
>> > > +
>> > > + if ((primary_dev && failover_xmit_ready(primary_dev)) ||
>> > > + (standby_dev && failover_xmit_ready(standby_dev))) {
>> > > + netif_carrier_on(failover_dev);
>> > > + netif_tx_wake_all_queues(failover_dev);
>> > > + } else {
>> > > + netif_carrier_off(failover_dev);
>> > > + netif_tx_stop_all_queues(failover_dev);
>> > And I think it's a good idea to get stats from device here too.
>>
>> Not sure why we need to get stats from lower devs here?
>
> link down is often indication of a hardware problem.
> lower dev might stop responding down the road.
>
>> > > +static const struct net_device_ops failover_dev_ops = {
>> > > + .ndo_open   = failover_open,
>> > > + .ndo_stop   = failover_close,
>> > > + .ndo_start_xmit = failover_start_xmit,
>> > > + .ndo_select_queue   = failover_select_queue,
>> > > + .ndo_get_stats64= failover_get_stats,
>> > > + .ndo_change_mtu = failover_change_mtu,
>> > > + .ndo_set_rx_mode= failover_set_rx_mode,
>> > > + .ndo_validate_addr  = eth_validate_addr,
>> > > + .ndo_features_check = passthru_features_check,
>> > xdp support?
>>
>> I think it should be possible to add it be calling the lower dev ndo_xdp 
>> routines
>> with proper checks. can we add this later?
>
> I'd be concerned that if you don't xdp userspace will keep poking
> at lower devs. Then it will stop working if you add this later.

The failover device is better off not providing in-driver XDP since
there are already skbs allocated by the time that we see the packet
here anyway. As such generic XDP is the preferred way to handle this
since it will work regardless of what lower devices are present.

The only advantage of having XDP down at the virtio or VF level would
be that it performs better, but at the cost of complexity since we
would need to rebind the eBPF program any time a device is hotplugged
out and then back in. For now the current approach is in keeping with
how bonding and other similar drivers are currently handling this.

Thanks.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-27 Thread Alexander Duyck
On Tue, Feb 27, 2018 at 12:49 AM, Jiri Pirko  wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote:
>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko  wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Ppatch 2 is in response to the community request for a 3 netdev
solution.  However, it creates some issues we'll get into in a moment.
It extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.
>>>
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>>
>>The problem with the bond and team drivers is they are too large and
>>have too many interfaces available for configuration so as a result
>>they can really screw this interface up.
>>
>>Essentially this is meant to be a bond that is more-or-less managed by
>>the host, not the guest. We want the host to be able to configure it
>>and have it automatically kick in on the guest. For now we want to
>>avoid adding too much complexity as this is meant to be just the first
>>step. Trying to go in and implement the whole solution right from the
>>start based on existing drivers is going to be a massive time sink and
>>will likely never get completed due to the fact that there is always
>>going to be some other thing that will interfere.
>>
>>My personal hope is that we can look at doing a virtio-bond sort of
>>device that will handle all this as well as providing a communication
>>channel, but that is much further down the road. For now we only have
>>a single bit so the goal for now is trying to keep this as simple as
>>possible.
>
> I have another usecase that would require the solution to be different
> then what you suggest. Consider following scenario:
> - baremetal has 2 sr-iov nics
> - there is a vm, has 1 VF from each nics: vf0, vf1. No virtio_net
> - baremetal would like to somehow tell the VM to bond vf0 and vf1
>   together and how this bonding should be configured, according to how
>   the VF representors are configured on the baremetal (LACP for example)
>
> The baremetal could decide to remove any VF during the VM runtime, it
> can add another VF there. For migration, it can add virtio_net. The VM
> should be inctructed to bond all interfaces together according to how
> baremetal decided - as it knows better.
>
> For this we need a separate communication channel from baremetal to VM
> (perhaps something re-usable already exists), we need something to
> listen to the events coming from this channel (kernel/userspace) and to
> react accordingly (create bond/team, enslave, etc).
>
> Now the question is: is it possible to merge the demands you have and
> the generic needs I described into a single solution? From what I see,
> that would be quite hard/impossible. So at the end, I think that we have
> to end-up with 2 solutions:
> 1) virtio_net, netvsc in-driver bonding - very limited, stupid, 0config
>solution that works for all (no matter what OS you use in VM)
> 2) team/bond solution with assistance of preferably userspace daemon
>getting info from baremetal. This is not 0config, but minimal config
>- user just have to define this "magic bonding" should be on.
>This covers all possible usecases, including multiple VFs, RDMA, etc.
>
> Thoughts?

So that is about what I had in mind. We end up having to do something
completely different to support this more complex solution. I think we
might have referred to it as v2/v3 in a different thread, and
virt-bond in this thread.

Basically we need some sort of PCI or PCIe topology mapping for the
devices that can be translated into something we can communicate over
the communication channel. After that we also have the added
complexity of how do we figure out 

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-25 Thread Alexander Duyck
On Fri, Feb 23, 2018 at 3:59 PM, Stephen Hemminger
<step...@networkplumber.org> wrote:
> On Thu, 22 Feb 2018 13:30:12 -0800
> Alexander Duyck <alexander.du...@gmail.com> wrote:
>
>> > Again, I undertand your motivation. Yet I don't like your solution.
>> > But if the decision is made to do this in-driver bonding. I would like
>> > to see it baing done some generic way:
>> > 1) share the same "in-driver bonding core" code with netvsc
>> >put to net/core.
>> > 2) the "in-driver bonding core" will strictly limit the functionality,
>> >like active-backup mode only, one vf, one backup, vf netdev type
>> >check (so noone could enslave a tap or anything else)
>> > If user would need something more, he should employ team/bond.
>
> Sharing would be good, but netvsc world would really like to only have
> one visible network device.

Other than the netdev count are there any other issues we need to be
thinking about?

If I am not mistaken you netvsc doesn't put any broadcast/multicast
filters on the VF. If we ended up doing that in order to support the
virtio based solution would that cause any issues? I just realized we
had overlooked dealing with multicast in our current solution so we
will probably be looking at syncing the multicast list like what
occurs in netvsc, however we will need to do it for both the VF and
the virtio interfaces.

Thanks.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-25 Thread Alexander Duyck
On Fri, Feb 23, 2018 at 4:03 PM, Stephen Hemminger
<step...@networkplumber.org> wrote:
> (pruned to reduce thread)
>
> On Wed, 21 Feb 2018 16:17:19 -0800
> Alexander Duyck <alexander.du...@gmail.com> wrote:
>
>> >>> FWIW two solutions that immediately come to mind is to export "backup"
>> >>> as phys_port_name of the backup virtio link and/or assign a name to the
>> >>> master like you are doing already.  I think team uses team%d and bond
>> >>> uses bond%d, soft naming of master devices seems quite natural in this
>> >>> case.
>> >>
>> >> I figured I had overlooked something like that.. Thanks for pointing
>> >> this out. Okay so I think the phys_port_name approach might resolve
>> >> the original issue. If I am reading things correctly what we end up
>> >> with is the master showing up as "ens1" for example and the backup
>> >> showing up as "ens1nbackup". Am I understanding that right?
>> >>
>> >> The problem with the team/bond%d approach is that it creates a new
>> >> netdevice and so it would require guest configuration changes.
>> >>
>> >>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> >>> link is quite neat.
>> >>
>> >> I agree. For non-"backup" virio_net devices would it be okay for us to
>> >> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> >> behavior could be maintained although the function still exists.
>> >>
>> >>>> - When the 'active' netdev is unplugged OR not present on a destination
>> >>>>   system after live migration, the user will see 2 virtio_net netdevs.
>> >>>
>> >>> That's necessary and expected, all configuration applies to the master
>> >>> so master must exist.
>> >>
>> >> With the naming issue resolved this is the only item left outstanding.
>> >> This becomes a matter of form vs function.
>> >>
>> >> The main complaint about the "3 netdev" solution is a bit confusing to
>> >> have the 2 netdevs present if the VF isn't there. The idea is that
>> >> having the extra "master" netdev there if there isn't really a bond is
>> >> a bit ugly.
>> >
>> > Is it this uglier in terms of user experience rather than
>> > functionality? I don't want it dynamically changed between 2-netdev
>> > and 3-netdev depending on the presence of VF. That gets back to my
>> > original question and suggestion earlier: why not just hide the lower
>> > netdevs from udev renaming and such? Which important observability
>> > benefits users may get if exposing the lower netdevs?
>> >
>> > Thanks,
>> > -Siwei
>>
>> The only real advantage to a 2 netdev solution is that it looks like
>> the netvsc solution, however it doesn't behave like it since there are
>> some features like XDP that may not function correctly if they are
>> left enabled in the virtio_net interface.
>>
>> As far as functionality the advantage of not hiding the lower devices
>> is that they are free to be managed. The problem with pushing all of
>> the configuration into the upper device is that you are limited to the
>> intersection of the features of the lower devices. This can be
>> limiting for some setups as some VFs support things like more queues,
>> or better interrupt moderation options than others so trying to make
>> everything work with one config would be ugly.
>>
>
>
> Let's not make XDP the blocker for doing the best solution
> from the end user point of view. XDP is just yet another offload
> thing which needs to be handled.  The current backup device solution
> used in netvsc doesn't handle the full range of offload options
> (things like flow direction, DCB, etc); no one but the HW vendors
> seems to care.

XDP isn't the blocker here. As far as I am concerned we can go either
way, with a 2 netdev or a 3 netdev solution. We just need to make sure
we are aware of all the trade-offs, and make a decision one way or the
other. This is quickly turning into a bikeshed and I would prefer us
to all agree, or at least disagree and commit, on which way to go
before we burn more cycles on a patch set that seems to be getting
tied up in debate.

With the 2 netdev solution we have to limit the functionality so that
we don't break things when we bypass the guts of the driver to hand
traffic off to the VF. Then ends up meaning that we are stuck wi

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-22 Thread Alexander Duyck
On Thu, Feb 22, 2018 at 12:11 AM, Jiri Pirko  wrote:
> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.du...@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko  wrote:
>>> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.du...@gmail.com wrote:
On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko  wrote:
> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko  wrote:
>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote:
On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko  wrote:
> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>> Yeah, I can see it now :( I guess that the ship has sailed and we 
>>> are
>>> stuck with this ugly thing forever...
>>>
>>> Could you at least make some common code that is shared in between
>>> netvsc and virtio_net so this is handled in exacly the same way in 
>>> both?
>>
>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>Let's not make a far, far more commonly deployed and important driver
>>(virtio) bug-compatible with netvsc.
>
> Yeah. netvsc solution is a dangerous precedent here and in my 
> opinition
> it was a huge mistake to merge it. I personally would vote to unmerge 
> it
> and make the solution based on team/bond.
>
>
>>
>>To Jiri's initial comments, I feel the same way, in fact I've talked 
>>to
>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>user space.  I think it may very well get done in next versions of NM,
>>but isn't done yet.  Stephen also raised the point that not everybody 
>>is
>>using NM.
>
> Can be done in NM, networkd or other network management tools.
> Even easier to do this in teamd and let them all benefit.
>
> Actually, I took a stab to implement this in teamd. Took me like an 
> hour
> and half.
>
> You can just run teamd with config option "kidnap" like this:
> # teamd/teamd -c '{"kidnap": true }'
>
> Whenever teamd sees another netdev to appear with the same mac as his,
> or whenever teamd sees another netdev to change mac to his,
> it enslaves it.
>
> Here's the patch (quick and dirty):
>
> Subject: [patch teamd] teamd: introduce kidnap feature
>
> Signed-off-by: Jiri Pirko 

So this doesn't really address the original problem we were trying to
solve. You asked earlier why the netdev name mattered and it mostly
has to do with configuration. Specifically what our patch is
attempting to resolve is the issue of how to allow a cloud provider to
upgrade their customer to SR-IOV support and live migration without
requiring them to reconfigure their guest. So the general idea with
our patch is to take a VM that is running with virtio_net only and
allow it to instead spawn a virtio_bypass master using the same netdev
name as the original virtio, and then have the virtio_net and VF come
up and be enslaved by the bypass interface. Doing it this way we can
allow for multi-vendor SR-IOV live migration support using a guest
that was originally configured for virtio only.

The problem with your solution is we already have teaming and bonding
as you said. There is already a write-up from Red Hat on how to do it
(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
That is all well and good as long as you are willing to keep around
two VM images, one for virtio, and one for SR-IOV with live migration.
>>>
>>> You don't need 2 images. You need only one. The one with the team setup.
>>> That's it. If another netdev with the same mac appears, teamd will
>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>> virtio_net.
>>
>>Isn't that going to cause the routing table to get messed up when we
>>rearrange the netdevs? We don't want to have an significant disruption
>> in traffic when we are adding/removing the VF. It seems like we would
>>need to invalidate any entries that were configured for the virtio_net
>>and reestablish them on the new team interface. Part of the criteria
>>we have been working with is that we should be able to transition from

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-22 Thread Alexander Duyck
On Thu, Feb 22, 2018 at 5:07 AM, Jiri Pirko  wrote:
> Thu, Feb 22, 2018 at 12:54:45PM CET, gerlitz...@gmail.com wrote:
>>On Thu, Feb 22, 2018 at 10:11 AM, Jiri Pirko  wrote:
>>> Wed, Feb 21, 2018 at 09:57:09PM CET, alexander.du...@gmail.com wrote:
>>
The signaling isn't too much of an issue since we can just tweak the
link state of the VF or virtio manually to report the link up or down
prior to the hot-plug. Now that we are on the same page with the team0
>>
>>> Oh, so you just do "ip link set vfrepresentor down" in the host.
>>> That makes sense. I'm pretty sure that this is not implemented for all
>>> drivers now.
>>
>>mlx5 supports that, on the representor close ndo we take the VF link
>>operational v-link down
>>
>>We should probably also put into the picture some/more aspects
>>from the host side of things. The provisioning of the v-switch now
>>have to deal with two channels going into the VM, the PV (virtio)
>>one and the PT (VF) one.
>>
>>This should probably boil down to apply teaming/bonding between
>>the VF representor and a PV backend device, e.g TAP.
>
> Yes, that is correct.

That was my thought on it. If you wanted to you could probably even
look at making the PV the active one in the pair from the host side if
you wanted to avoid the PCIe overhead for things like
broadcast/multicast. The only limitation is that you might need to
have the bond take care of the appropriate switchdev bits so that you
still programmed rules into the hardware even if you are transmitting
down the PV side of the device.

For legacy setups I still need to work on putting together a source
mode macvlan based setup to handle acting like port representors for
the VFs and uplink.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Alexander Duyck
On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu <losewe...@gmail.com> wrote:
> I haven't checked emails for days and did not realize the new revision
> had already came out. And thank you for the effort, this revision
> really looks to be a step forward towards our use case and is close to
> what we wanted to do. A few questions in line.
>
> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
> <alexander.du...@gmail.com> wrote:
>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>>> Ppatch 2 is in response to the community request for a 3 netdev
>>>> solution.  However, it creates some issues we'll get into in a moment.
>>>> It extends virtio_net to use alternate datapath when available and
>>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>>> an additional 'bypass' netdev that acts as a master device and controls
>>>> 2 slave devices.  The original virtio_net netdev is registered as
>>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>>> associated with the same 'pci' device.  The user accesses the network
>>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>>> as default for transmits when it is available with link up and running.
>>>
>>> Thank you do doing this.
>>>
>>>> We noticed a couple of issues with this approach during testing.
>>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>>   virtio pci device, udev tries to rename both of them with the same name
>>>>   and the 2nd rename will fail. This would be OK as long as the first 
>>>> netdev
>>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>>   to rename the 2 netdevs is not reliable.
>>>
>>> Out of curiosity - why do you link the master netdev to the virtio
>>> struct device?
>>
>> The basic idea of all this is that we wanted this to work with an
>> existing VM image that was using virtio. As such we were trying to
>> make it so that the bypass interface takes the place of the original
>> virtio and get udev to rename the bypass to what the original
>> virtio_net was.
>
> Could it made it also possible to take over the config from VF instead
> of virtio on an existing VM image? And get udev rename the bypass
> netdev to what the original VF was. I don't say tightly binding the
> bypass master to only virtio or VF, but I think we should provide both
> options to support different upgrade paths. Possibly we could tweak
> the device tree layout to reuse the same PCI slot for the master
> bypass netdev, such that udev would not get confused when renaming the
> device. The VF needs to use a different function slot afterwards.
> Perhaps we might need to a special multiseat like QEMU device for that
> purpose?
>
> Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

The other bits get into more complexity then we are ready to handle
for now. I think I might have talked about something similar that I
was referring to as a "virtio-bond" where you would have a PCI/PCIe
tree topology that makes this easier to sort out, and the "virtio-bond
would be used to handle coordination/configuration of a much more
complex interface.

>>
>>> FWIW two solutions that immediately come to mind is to export "backup"
>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>> master like you are doing already.  I think team uses team%d and bond
>>> uses bond%d, soft naming of master devices seems quite natural in this
>>> case.
>>
>> I figured I had overlooked something like that.. Thanks for pointing
>> this out. Okay so I think the phys_port_name approach might resolve
>> the original issue. If I am reading things correctly what we end up
>> with is the master showing up as "ens1" for example and the backup
>> showing up

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Alexander Duyck
On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko  wrote:
> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.du...@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko  wrote:
>>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote:
On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko  wrote:
> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko  wrote:
>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
>
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in 
> both?

IMHO netvsc is a vendor specific driver which made a mistake on what
behaviour it provides (or tried to align itself with Windows SR-IOV).
Let's not make a far, far more commonly deployed and important driver
(virtio) bug-compatible with netvsc.
>>>
>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>> and make the solution based on team/bond.
>>>
>>>

To Jiri's initial comments, I feel the same way, in fact I've talked to
the NetworkManager guys to get auto-bonding based on MACs handled in
user space.  I think it may very well get done in next versions of NM,
but isn't done yet.  Stephen also raised the point that not everybody is
using NM.
>>>
>>> Can be done in NM, networkd or other network management tools.
>>> Even easier to do this in teamd and let them all benefit.
>>>
>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>> and half.
>>>
>>> You can just run teamd with config option "kidnap" like this:
>>> # teamd/teamd -c '{"kidnap": true }'
>>>
>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>> or whenever teamd sees another netdev to change mac to his,
>>> it enslaves it.
>>>
>>> Here's the patch (quick and dirty):
>>>
>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>
>>> Signed-off-by: Jiri Pirko 
>>
>>So this doesn't really address the original problem we were trying to
>>solve. You asked earlier why the netdev name mattered and it mostly
>>has to do with configuration. Specifically what our patch is
>>attempting to resolve is the issue of how to allow a cloud provider to
>>upgrade their customer to SR-IOV support and live migration without
>>requiring them to reconfigure their guest. So the general idea with
>>our patch is to take a VM that is running with virtio_net only and
>>allow it to instead spawn a virtio_bypass master using the same netdev
>>name as the original virtio, and then have the virtio_net and VF come
>>up and be enslaved by the bypass interface. Doing it this way we can
>>allow for multi-vendor SR-IOV live migration support using a guest
>>that was originally configured for virtio only.
>>
>>The problem with your solution is we already have teaming and bonding
>>as you said. There is already a write-up from Red Hat on how to do it
>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>That is all well and good as long as you are willing to keep around
>>two VM images, one for virtio, and one for SR-IOV with live migration.
>
> You don't need 2 images. You need only one. The one with the team setup.
> That's it. If another netdev with the same mac appears, teamd will
> enslave it and run traffic on it. If not, ok, you'll go only through
> virtio_net.

Isn't that going to cause the routing table to get messed up when we
rearrange the netdevs? We don't want to have an significant disruption
 in traffic when we are adding/removing the VF. It seems like we would
need to invalidate any entries that were configured for the virtio_net
and reestablish them on the new team interface. Part of the criteria
we have been working with is that we should be able to transition from
having a VF to not or vice versa without seeing any significant
disruption in the traffic.
>>>
>>> What? You have routes on the team netdev. virtio_net and VF are only
>>> slaves. What are you talking about? I don't get it :/
>>
>>So lets walk though this by example. The general idea of the base case
>>for all this is somebody starting with virtio_net, we will call the

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Alexander Duyck
On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko  wrote:
> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko  wrote:
>>> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote:
On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko  wrote:
> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>> stuck with this ugly thing forever...
>>>
>>> Could you at least make some common code that is shared in between
>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>
>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>Let's not make a far, far more commonly deployed and important driver
>>(virtio) bug-compatible with netvsc.
>
> Yeah. netvsc solution is a dangerous precedent here and in my opinition
> it was a huge mistake to merge it. I personally would vote to unmerge it
> and make the solution based on team/bond.
>
>
>>
>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>user space.  I think it may very well get done in next versions of NM,
>>but isn't done yet.  Stephen also raised the point that not everybody is
>>using NM.
>
> Can be done in NM, networkd or other network management tools.
> Even easier to do this in teamd and let them all benefit.
>
> Actually, I took a stab to implement this in teamd. Took me like an hour
> and half.
>
> You can just run teamd with config option "kidnap" like this:
> # teamd/teamd -c '{"kidnap": true }'
>
> Whenever teamd sees another netdev to appear with the same mac as his,
> or whenever teamd sees another netdev to change mac to his,
> it enslaves it.
>
> Here's the patch (quick and dirty):
>
> Subject: [patch teamd] teamd: introduce kidnap feature
>
> Signed-off-by: Jiri Pirko 

So this doesn't really address the original problem we were trying to
solve. You asked earlier why the netdev name mattered and it mostly
has to do with configuration. Specifically what our patch is
attempting to resolve is the issue of how to allow a cloud provider to
upgrade their customer to SR-IOV support and live migration without
requiring them to reconfigure their guest. So the general idea with
our patch is to take a VM that is running with virtio_net only and
allow it to instead spawn a virtio_bypass master using the same netdev
name as the original virtio, and then have the virtio_net and VF come
up and be enslaved by the bypass interface. Doing it this way we can
allow for multi-vendor SR-IOV live migration support using a guest
that was originally configured for virtio only.

The problem with your solution is we already have teaming and bonding
as you said. There is already a write-up from Red Hat on how to do it
(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
That is all well and good as long as you are willing to keep around
two VM images, one for virtio, and one for SR-IOV with live migration.
>>>
>>> You don't need 2 images. You need only one. The one with the team setup.
>>> That's it. If another netdev with the same mac appears, teamd will
>>> enslave it and run traffic on it. If not, ok, you'll go only through
>>> virtio_net.
>>
>>Isn't that going to cause the routing table to get messed up when we
>>rearrange the netdevs? We don't want to have an significant disruption
>> in traffic when we are adding/removing the VF. It seems like we would
>>need to invalidate any entries that were configured for the virtio_net
>>and reestablish them on the new team interface. Part of the criteria
>>we have been working with is that we should be able to transition from
>>having a VF to not or vice versa without seeing any significant
>>disruption in the traffic.
>
> What? You have routes on the team netdev. virtio_net and VF are only
> slaves. What are you talking about? I don't get it :/

So lets walk though this by example. The general idea of the base case
for all this is somebody starting with virtio_net, we will call the
interface "ens1" for now. It comes up and is assigned a dhcp address
and everything works as expected. Now in order to get better
performance we want to add a VF "ens2", but we don't want a new IP
address. Now if I understand correctly what will happen is that when
"ens2" appears on the system teamd will then create a new team

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Alexander Duyck
On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko  wrote:
> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko  wrote:
>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
>
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in both?

IMHO netvsc is a vendor specific driver which made a mistake on what
behaviour it provides (or tried to align itself with Windows SR-IOV).
Let's not make a far, far more commonly deployed and important driver
(virtio) bug-compatible with netvsc.
>>>
>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>> and make the solution based on team/bond.
>>>
>>>

To Jiri's initial comments, I feel the same way, in fact I've talked to
the NetworkManager guys to get auto-bonding based on MACs handled in
user space.  I think it may very well get done in next versions of NM,
but isn't done yet.  Stephen also raised the point that not everybody is
using NM.
>>>
>>> Can be done in NM, networkd or other network management tools.
>>> Even easier to do this in teamd and let them all benefit.
>>>
>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>> and half.
>>>
>>> You can just run teamd with config option "kidnap" like this:
>>> # teamd/teamd -c '{"kidnap": true }'
>>>
>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>> or whenever teamd sees another netdev to change mac to his,
>>> it enslaves it.
>>>
>>> Here's the patch (quick and dirty):
>>>
>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>
>>> Signed-off-by: Jiri Pirko 
>>
>>So this doesn't really address the original problem we were trying to
>>solve. You asked earlier why the netdev name mattered and it mostly
>>has to do with configuration. Specifically what our patch is
>>attempting to resolve is the issue of how to allow a cloud provider to
>>upgrade their customer to SR-IOV support and live migration without
>>requiring them to reconfigure their guest. So the general idea with
>>our patch is to take a VM that is running with virtio_net only and
>>allow it to instead spawn a virtio_bypass master using the same netdev
>>name as the original virtio, and then have the virtio_net and VF come
>>up and be enslaved by the bypass interface. Doing it this way we can
>>allow for multi-vendor SR-IOV live migration support using a guest
>>that was originally configured for virtio only.
>>
>>The problem with your solution is we already have teaming and bonding
>>as you said. There is already a write-up from Red Hat on how to do it
>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>That is all well and good as long as you are willing to keep around
>>two VM images, one for virtio, and one for SR-IOV with live migration.
>
> You don't need 2 images. You need only one. The one with the team setup.
> That's it. If another netdev with the same mac appears, teamd will
> enslave it and run traffic on it. If not, ok, you'll go only through
> virtio_net.

Isn't that going to cause the routing table to get messed up when we
rearrange the netdevs? We don't want to have an significant disruption
 in traffic when we are adding/removing the VF. It seems like we would
need to invalidate any entries that were configured for the virtio_net
and reestablish them on the new team interface. Part of the criteria
we have been working with is that we should be able to transition from
having a VF to not or vice versa without seeing any significant
disruption in the traffic.

Also how does this handle any static configuration? I am assuming that
everything here assumes the team will be brought up as soon as it is
seen and assigned a DHCP address.

The solution as you have proposed seems problematic at best. I don't
see how the team solution works without introducing some sort of
traffic disruption to either add/remove the VF and bring up/tear down
the team interface. At that point we might as well just give up on
this piece of live migration support entirely since the disruption was
what we were trying to avoid. We might as well just hotplug out the VF
and hotplug in a virtio at the same bus device and function number and
just let udev take care of renaming it for us. The idea was supposed
to be a seamless transition between the two interfaces.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Alexander Duyck
On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko  wrote:
> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
>>> Yeah, I can see it now :( I guess that the ship has sailed and we are
>>> stuck with this ugly thing forever...
>>>
>>> Could you at least make some common code that is shared in between
>>> netvsc and virtio_net so this is handled in exacly the same way in both?
>>
>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>Let's not make a far, far more commonly deployed and important driver
>>(virtio) bug-compatible with netvsc.
>
> Yeah. netvsc solution is a dangerous precedent here and in my opinition
> it was a huge mistake to merge it. I personally would vote to unmerge it
> and make the solution based on team/bond.
>
>
>>
>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>user space.  I think it may very well get done in next versions of NM,
>>but isn't done yet.  Stephen also raised the point that not everybody is
>>using NM.
>
> Can be done in NM, networkd or other network management tools.
> Even easier to do this in teamd and let them all benefit.
>
> Actually, I took a stab to implement this in teamd. Took me like an hour
> and half.
>
> You can just run teamd with config option "kidnap" like this:
> # teamd/teamd -c '{"kidnap": true }'
>
> Whenever teamd sees another netdev to appear with the same mac as his,
> or whenever teamd sees another netdev to change mac to his,
> it enslaves it.
>
> Here's the patch (quick and dirty):
>
> Subject: [patch teamd] teamd: introduce kidnap feature
>
> Signed-off-by: Jiri Pirko 

So this doesn't really address the original problem we were trying to
solve. You asked earlier why the netdev name mattered and it mostly
has to do with configuration. Specifically what our patch is
attempting to resolve is the issue of how to allow a cloud provider to
upgrade their customer to SR-IOV support and live migration without
requiring them to reconfigure their guest. So the general idea with
our patch is to take a VM that is running with virtio_net only and
allow it to instead spawn a virtio_bypass master using the same netdev
name as the original virtio, and then have the virtio_net and VF come
up and be enslaved by the bypass interface. Doing it this way we can
allow for multi-vendor SR-IOV live migration support using a guest
that was originally configured for virtio only.

The problem with your solution is we already have teaming and bonding
as you said. There is already a write-up from Red Hat on how to do it
(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
That is all well and good as long as you are willing to keep around
two VM images, one for virtio, and one for SR-IOV with live migration.
The problem is nobody wants to do that. What they want is to maintain
one guest image and if they decide to upgrade to SR-IOV they still
want their live migration and they don't want to have to reconfigure
the guest.

That said it does seem to make the existing Red Hat solution easier to
manage since you wouldn't be guessing at ifname so I have provided
some feedback below.

> ---
>  include/team.h |  7 +++
>  libteam/ifinfo.c   | 20 
>  teamd/teamd.c  | 17 +
>  teamd/teamd.h  |  5 +
>  teamd/teamd_events.c   | 17 +
>  teamd/teamd_ifinfo_watch.c |  9 +
>  teamd/teamd_per_port.c |  7 ++-
>  7 files changed, 81 insertions(+), 1 deletion(-)
>
> diff --git a/include/team.h b/include/team.h
> index 9ae517d..b0c19c8 100644
> --- a/include/team.h
> +++ b/include/team.h
> @@ -137,6 +137,13 @@ struct team_ifinfo *team_get_next_ifinfo(struct 
> team_handle *th,
>  #define team_for_each_ifinfo(ifinfo, th)   \
> for (ifinfo = team_get_next_ifinfo(th, NULL); ifinfo;   \
>  ifinfo = team_get_next_ifinfo(th, ifinfo))
> +
> +struct team_ifinfo *team_get_next_unlinked_ifinfo(struct team_handle *th,
> + struct team_ifinfo *ifinfo);
> +#define team_for_each_unlinked_ifinfo(ifinfo, th)  \
> +   for (ifinfo = team_get_next_unlinked_ifinfo(th, NULL); ifinfo;  \
> +ifinfo = team_get_next_unlinked_ifinfo(th, ifinfo))
> +
>  /* ifinfo getters */
>  bool team_is_ifinfo_removed(struct team_ifinfo *ifinfo);
>  uint32_t team_get_ifinfo_ifindex(struct team_ifinfo *ifinfo);
> diff --git a/libteam/ifinfo.c b/libteam/ifinfo.c
> index 5c32a9c..8f9548e 100644
> --- a/libteam/ifinfo.c
> +++ b/libteam/ifinfo.c
> @@ -494,6 +494,26 @@ struct team_ifinfo 

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Alexander Duyck
On Tue, Feb 20, 2018 at 12:14 PM, Jiri Pirko  wrote:
> Tue, Feb 20, 2018 at 06:14:32PM CET, sridhar.samudr...@intel.com wrote:
>>On 2/20/2018 8:29 AM, Jiri Pirko wrote:
>>> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote:
>>> > On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko  wrote:
>>> > > Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
>>> > > > Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>> > > > used by hypervisor to indicate that virtio_net interface should act as
>>> > > > a backup for another device with the same MAC address.
>>> > > >
>>> > > > Ppatch 2 is in response to the community request for a 3 netdev
>>> > > > solution.  However, it creates some issues we'll get into in a moment.
>>> > > > It extends virtio_net to use alternate datapath when available and
>>> > > > registered. When BACKUP feature is enabled, virtio_net driver creates
>>> > > > an additional 'bypass' netdev that acts as a master device and 
>>> > > > controls
>>> > > > 2 slave devices.  The original virtio_net netdev is registered as
>>> > > > 'backup' netdev and a passthru/vf device with the same MAC gets
>>> > > > registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> > > > associated with the same 'pci' device.  The user accesses the network
>>> > > > interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' 
>>> > > > netdev
>>> > > > as default for transmits when it is available with link up and 
>>> > > > running.
>>> > > Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> > > of bonding driver as a part of NIC driver. Bond and team drivers
>>> > > are mature solutions, well tested, broadly used, with lots of issues
>>> > > resolved in the past. What you try to introduce is a weird shortcut
>>> > > that already has couple of issues as you mentioned and will certanly
>>> > > have many more. Also, I'm pretty sure that in future, someone comes up
>>> > > with ideas like multiple VFs, LACP and similar bonding things.
>>> > The problem with the bond and team drivers is they are too large and
>>> > have too many interfaces available for configuration so as a result
>>> > they can really screw this interface up.
>>> What? Too large is which sense? Why "too many interfaces" is a problem?
>>> Also, team has only one interface to userspace team-generic-netlink.
>>>
>>>
>>> > Essentially this is meant to be a bond that is more-or-less managed by
>>> > the host, not the guest. We want the host to be able to configure it
>>> How is it managed by the host? In your usecase the guest has 2 netdevs:
>>> virtio_net, pci vf.
>>> I don't see how host can do any managing of that, other than the
>>> obvious. But still, the active/backup decision is done in guest. This is
>>> a simple bond/team usecase. As I said, there is something needed to be
>>> implemented in userspace in order to handle re-appear of vf netdev.
>>> But that should be fairly easy to do in teamd.
>>
>>The host manages the active/backup decision by
>>- assigning the same MAC address to both VF and virtio interfaces
>>- setting a BACKUP feature bit on virtio that enables virtio to transparently
>>take
>>  over the VFs datapath.
>>- only enable one datapath at anytime so that packets don't get looped back
>>- during live migration enable virtio datapth, unplug vf on the source and
>>replug
>>  vf on the destination.
>>
>>The VM is not expected and doesn't have any control of setting the MAC
>>address
>>or bringing up/down the links.
>>
>>This is the model that is currently supported with netvsc driver on Azure.
>
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
>
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in both?
>
> The fact that the netvsc/virtio_net kidnaps a netdev only because it
> has the same mac is going to give me some serious nighmares...
> I think we need to introduce some more strict checks.

In order for that to work we need to settle on a model for these. The
issue is that netvsc is using what we refer to as the "2 netdev" model
where they don't expose the paravirtual interface as its own netdev.
The opinion of Jakub and others has been that we should do a "3
netdev" model in the case of virtio_net since otherwise we will lose
functionality such as in-driver XDP and have to deal with an extra set
of qdiscs and Tx queue locks on transmit path.

Really at this point I am good either way, but we need to probably
have Stephen, Jakub, and whoever else had an opinion on the matter
sort out the 2 vs 3 argument before we could proceed on that. Most of
patch 2 in the set can easily be broken out into a separate file later
if we decide to go that route.

Thanks.

- Alex
___
Virtualization mailing list

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Alexander Duyck
On Tue, Feb 20, 2018 at 8:29 AM, Jiri Pirko  wrote:
> Tue, Feb 20, 2018 at 05:04:29PM CET, alexander.du...@gmail.com wrote:
>>On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko  wrote:
>>> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
used by hypervisor to indicate that virtio_net interface should act as
a backup for another device with the same MAC address.

Ppatch 2 is in response to the community request for a 3 netdev
solution.  However, it creates some issues we'll get into in a moment.
It extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.
>>>
>>> Sorry, but this is ridiculous. You are apparently re-implemeting part
>>> of bonding driver as a part of NIC driver. Bond and team drivers
>>> are mature solutions, well tested, broadly used, with lots of issues
>>> resolved in the past. What you try to introduce is a weird shortcut
>>> that already has couple of issues as you mentioned and will certanly
>>> have many more. Also, I'm pretty sure that in future, someone comes up
>>> with ideas like multiple VFs, LACP and similar bonding things.
>>
>>The problem with the bond and team drivers is they are too large and
>>have too many interfaces available for configuration so as a result
>>they can really screw this interface up.
>
> What? Too large is which sense? Why "too many interfaces" is a problem?
> Also, team has only one interface to userspace team-generic-netlink.

Specifically I was working with bond. I had overlooked team for the
most part since it required an additional userspace daemon which
basically broke our requirement of no user-space intervention.

I was trying to focus on just doing an active/backup setup. The
problem is there are debugfs, sysfs, and procfs interfaces exposed
that we don't need and/or want. Adding any sort of interface to
exclude these would just bloat up the bonding driver, and leaving them
in would just be confusing since they would all need to be ignored. In
addition the steps needed to get the name to come out the same as the
original virtio interface would just bloat up bonding.

>>
>>Essentially this is meant to be a bond that is more-or-less managed by
>>the host, not the guest. We want the host to be able to configure it
>
> How is it managed by the host? In your usecase the guest has 2 netdevs:
> virtio_net, pci vf.
> I don't see how host can do any managing of that, other than the
> obvious. But still, the active/backup decision is done in guest. This is
> a simple bond/team usecase. As I said, there is something needed to be
> implemented in userspace in order to handle re-appear of vf netdev.
> But that should be fairly easy to do in teamd.
>
>
>>and have it automatically kick in on the guest. For now we want to
>>avoid adding too much complexity as this is meant to be just the first
>
> That's what I fear, "for now"..

I used the expression "for now" as I see this being the first stage of
a multi-stage process.

Step 1 is to get a basic virtio-bypass driver added to virtio so that
it is at least comparable to netvsc in terms of feature set and
enables basic network live migration.

Step 2 is adding some sort of dirty page tracking, preferably via
something like a paravirtual iommu interface. Once we have that we can
defer the eviction of the VF until the very last moment of the live
migration. For now I need to work on testing a modification to allow
mapping the entire guest as being pass-through for DMA to the device,
and requiring dynamic for any DMA that is bidirectional or from the
device.

Step 3 will be to start looking at advanced configuration. That is
where we drop the implementation in step 1 and instead look at
spawning something that looks more like the team type interface,
however instead of working with a user-space daemon we would likely
need to work with some sort of mailbox or message queue coming up from
the hypervisor. Then we can start looking at doing things like passing
up blocks of eBPF code to handle Tx port selection or whatever we
need.

>
>>step. Trying to go in and implement the whole solution right from the
>>start based on existing drivers is going to be a massive time sink and
>>will likely never get completed due to the fact that there is always
>>going to be some other thing that will 

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-20 Thread Alexander Duyck
On Tue, Feb 20, 2018 at 2:42 AM, Jiri Pirko  wrote:
> Fri, Feb 16, 2018 at 07:11:19PM CET, sridhar.samudr...@intel.com wrote:
>>Patch 1 introduces a new feature bit VIRTIO_NET_F_BACKUP that can be
>>used by hypervisor to indicate that virtio_net interface should act as
>>a backup for another device with the same MAC address.
>>
>>Ppatch 2 is in response to the community request for a 3 netdev
>>solution.  However, it creates some issues we'll get into in a moment.
>>It extends virtio_net to use alternate datapath when available and
>>registered. When BACKUP feature is enabled, virtio_net driver creates
>>an additional 'bypass' netdev that acts as a master device and controls
>>2 slave devices.  The original virtio_net netdev is registered as
>>'backup' netdev and a passthru/vf device with the same MAC gets
>>registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>associated with the same 'pci' device.  The user accesses the network
>>interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>as default for transmits when it is available with link up and running.
>
> Sorry, but this is ridiculous. You are apparently re-implemeting part
> of bonding driver as a part of NIC driver. Bond and team drivers
> are mature solutions, well tested, broadly used, with lots of issues
> resolved in the past. What you try to introduce is a weird shortcut
> that already has couple of issues as you mentioned and will certanly
> have many more. Also, I'm pretty sure that in future, someone comes up
> with ideas like multiple VFs, LACP and similar bonding things.

The problem with the bond and team drivers is they are too large and
have too many interfaces available for configuration so as a result
they can really screw this interface up.

Essentially this is meant to be a bond that is more-or-less managed by
the host, not the guest. We want the host to be able to configure it
and have it automatically kick in on the guest. For now we want to
avoid adding too much complexity as this is meant to be just the first
step. Trying to go in and implement the whole solution right from the
start based on existing drivers is going to be a massive time sink and
will likely never get completed due to the fact that there is always
going to be some other thing that will interfere.

My personal hope is that we can look at doing a virtio-bond sort of
device that will handle all this as well as providing a communication
channel, but that is much further down the road. For now we only have
a single bit so the goal for now is trying to keep this as simple as
possible.

> What is the reason for this abomination? According to:
> https://marc.info/?l=linux-virtualization=151189725224231=2
> The reason is quite weak.
> User in the vm sees 2 (or more) netdevices, he puts them in bond/team
> and that's it. This works now! If the vm lacks some userspace features,
> let's fix it there! For example the MAC changes is something that could
> be easily handled in teamd userspace deamon.

I think you might have missed the point of this. This is meant to be a
simple interface so the guest should not be able to change the MAC
address, and it shouldn't require any userspace daemon to setup or
tear down. Ideally with this solution the virtio bypass will come up
and be assigned the name of the original virtio, and the "backup"
interface will come up and be assigned the name of the original virtio
with an additional "nbackup" tacked on via the phys_port_name, and
then whenever a VF is added it will automatically be enslaved by the
bypass interface, and it will be removed when the VF is hotplugged
out.

In my mind the difference between this and bond or team is where the
configuration interface lies. In the case of bond it is in the kernel.
If my understanding is correct team is mostly in user space. With this
the configuration interface is really down in the hypervisor and
requests are communicated up to the guest. I would prefer not to make
virtio_net dependent on the bonding or team drivers, or worse yet a
userspace daemon in the guest. For now I would argue we should keep
this as simple as possible just to support basic live migration. There
has already been discussions of refactoring this after it is in so
that we can start to combine the functionality here with what is there
in bonding/team, but the differences in configuration interface and
the size of the code bases will make it challenging to outright merge
this into something like that.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC PATCH v3 2/3] virtio_net: Extend virtio to use VF datapath when available

2018-02-17 Thread Alexander Duyck
On Fri, Feb 16, 2018 at 7:04 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
> On Fri, 16 Feb 2018 10:11:21 -0800, Sridhar Samudrala wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to enable only one datapath at any time so that
>> packets don't get looped back to the VM over the other datapath. When a VF
>> is plugged, the virtio datapath link state can be marked as down. The
>> hypervisor needs to unplug the VF device from the guest on the source host
>> and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed,
>> the destination hypervisor sets the MAC filter on the VF and plugs it back
>> to the guest to switch over to VF datapath.
>>
>> When BACKUP feature is enabled, an additional netdev(bypass netdev) is
>> created that acts as a master device and tracks the state of the 2 lower
>> netdevs. The original virtio_net netdev is marked as 'backup' netdev and a
>> passthru device with the same MAC is registered as 'active' netdev.
>>
>> This patch is based on the discussion initiated by Jesse on this thread.
>> https://marc.info/?l=linux-virtualization=151189725224231=2
>>
>> Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> Signed-off-by: Alexander Duyck <alexander.h.du...@intel.com>
>
>> +static void
>> +virtnet_bypass_get_stats(struct net_device *dev,
>> +  struct rtnl_link_stats64 *stats)
>> +{
>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> + const struct rtnl_link_stats64 *new;
>> + struct rtnl_link_stats64 temp;
>> + struct net_device *child_netdev;
>> +
>> + spin_lock(>stats_lock);
>> + memcpy(stats, >bypass_stats, sizeof(*stats));
>> +
>> + rcu_read_lock();
>> +
>> + child_netdev = rcu_dereference(vbi->active_netdev);
>> + if (child_netdev) {
>> + new = dev_get_stats(child_netdev, );
>> + virtnet_bypass_fold_stats(stats, new, >active_stats);
>> + memcpy(>active_stats, new, sizeof(*new));
>> + }
>> +
>> + child_netdev = rcu_dereference(vbi->backup_netdev);
>> + if (child_netdev) {
>> + new = dev_get_stats(child_netdev, );
>> + virtnet_bypass_fold_stats(stats, new, >backup_stats);
>> + memcpy(>backup_stats, new, sizeof(*new));
>> + }
>> +
>> + rcu_read_unlock();
>> +
>> + memcpy(>bypass_stats, stats, sizeof(*stats));
>> + spin_unlock(>stats_lock);
>> +}
>> +
>> +static int virtnet_bypass_change_mtu(struct net_device *dev, int new_mtu)
>> +{
>> + struct virtnet_bypass_info *vbi = netdev_priv(dev);
>> + struct net_device *child_netdev;
>> + int ret = 0;
>> +
>> + child_netdev = rcu_dereference(vbi->active_netdev);
>> + if (child_netdev) {
>> + ret = dev_set_mtu(child_netdev, new_mtu);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + child_netdev = rcu_dereference(vbi->backup_netdev);
>> + if (child_netdev) {
>> + ret = dev_set_mtu(child_netdev, new_mtu);
>> + if (ret)
>> + netdev_err(child_netdev,
>> +"Unexpected failure to set mtu to %d\n",
>> +new_mtu);
>
> You should probably unwind if set fails on one of the legs.

Actually if we know that the backup is always going to be a virtio I
don't know if we even need to worry about the call failing. Last I
knew virtio_net doesn't implement ndo_change_mtu so I don't think it
is an issue. Unless a notifier blows up about it somewhere I don't
think there is anything that should prevent us from updating the MTU.

One interesting thing we may want to take a look at would be to tweak
the ordering of things based on if we are increasing or decreasing the
MTU. In the case of a increase we need to work from the bottom up, but
in the case of a decrease I wonder if we shouldn't be working from the
top down in order to guarantee we don't create an MTU mismatch
somewhere in the path.

>> + }
>> +
>> + dev->mtu = new_mtu;
>> + return 0;
>> +}
>
> nit: stats, mtu, all those mundane thin

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-17 Thread Alexander Duyck
On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>> Ppatch 2 is in response to the community request for a 3 netdev
>> solution.  However, it creates some issues we'll get into in a moment.
>> It extends virtio_net to use alternate datapath when available and
>> registered. When BACKUP feature is enabled, virtio_net driver creates
>> an additional 'bypass' netdev that acts as a master device and controls
>> 2 slave devices.  The original virtio_net netdev is registered as
>> 'backup' netdev and a passthru/vf device with the same MAC gets
>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>> associated with the same 'pci' device.  The user accesses the network
>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>> as default for transmits when it is available with link up and running.
>
> Thank you do doing this.
>
>> We noticed a couple of issues with this approach during testing.
>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>   virtio pci device, udev tries to rename both of them with the same name
>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>   to rename the 2 netdevs is not reliable.
>
> Out of curiosity - why do you link the master netdev to the virtio
> struct device?

The basic idea of all this is that we wanted this to work with an
existing VM image that was using virtio. As such we were trying to
make it so that the bypass interface takes the place of the original
virtio and get udev to rename the bypass to what the original
virtio_net was.

> FWIW two solutions that immediately come to mind is to export "backup"
> as phys_port_name of the backup virtio link and/or assign a name to the
> master like you are doing already.  I think team uses team%d and bond
> uses bond%d, soft naming of master devices seems quite natural in this
> case.

I figured I had overlooked something like that.. Thanks for pointing
this out. Okay so I think the phys_port_name approach might resolve
the original issue. If I am reading things correctly what we end up
with is the master showing up as "ens1" for example and the backup
showing up as "ens1nbackup". Am I understanding that right?

The problem with the team/bond%d approach is that it creates a new
netdevice and so it would require guest configuration changes.

> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
> link is quite neat.

I agree. For non-"backup" virio_net devices would it be okay for us to
just return -EOPNOTSUPP? I assume it would be and that way the legacy
behavior could be maintained although the function still exists.

>> - When the 'active' netdev is unplugged OR not present on a destination
>>   system after live migration, the user will see 2 virtio_net netdevs.
>
> That's necessary and expected, all configuration applies to the master
> so master must exist.

With the naming issue resolved this is the only item left outstanding.
This becomes a matter of form vs function.

The main complaint about the "3 netdev" solution is a bit confusing to
have the 2 netdevs present if the VF isn't there. The idea is that
having the extra "master" netdev there if there isn't really a bond is
a bit ugly.

The downside of the "2 netdev" solution is that you have to deal with
an extra layer of locking/queueing to get to the VF and you lose some
functionality since things like in-driver XDP have to be disabled in
order to maintain the same functionality when the VF is present or
not. However it looks more like classic virtio_net when the VF is not
present.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-29 Thread Alexander Duyck
On Mon, Jan 29, 2018 at 10:24 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Sun, Jan 28, 2018 at 08:26:53PM -0800, Alexander Duyck wrote:
>> >> > For live migration with advanced usecases that Siwei is suggesting, i
>> >> > think we need a new driver with a new device type that can track the
>> >> > VF specific feature settings even when the VF driver is unloaded.
>> >
>> > I see no added value of the 3 netdev model, there is no need for a bond
>> > device.
>>
>> I agree a full-blown bond isn't what is needed. However, just forking
>> traffic out from virtio to a VF doesn't really solve things either.
>>
>> One of the issues as I see it is the fact that the qdisc model with
>> the merged device gets to be pretty ugly. There is the fact that every
>> packet that goes to the VF has to pass through the qdisc code twice.
>> There is the dual nature of the 2 netdev solution that also introduces
>> the potential for head-of-line blocking since the virtio could put
>> back pressure on the upper qdisc layer which could stall the VF
>> traffic when switching over. I hope we could avoid issues like that by
>> maintaining qdiscs per device queue instead of on an upper device that
>> is half software interface and half not. Ideally the virtio-bond
>> device could operate without a qdisc and without needing any
>> additional locks so there shouldn't be head of line blocking occurring
>> between the two interfaces and overhead could be kept minimal.
>>
>> Also in the case of virtio there is support for in-driver XDP. As
>> Sridhar stated, when using the 2 netdev model "we cannot support XDP
>> in this model and it needs to be disabled". That sounds like a step
>> backwards instead of forwards. I would much rather leave the XDP
>> enabled at the lower dev level, and then if we want we can use the
>> generic XDP at the virtio-bond level to capture traffic on both
>> interfaces at the same time.
>
> I agree dropping XDP makes everything very iffy.
>
>> In the case of netvsc you have control of both sides of a given link
>> so you can match up the RSS tables, queue configuration and everything
>> is somewhat symmetric since you are running the PF and all the HyperV
>> subchannels. Most of the complexity is pushed down into the host and
>> your subchannel management is synchronized there if I am not mistaken.
>> We don't have this in the case of this virtio-bond setup. Instead a
>> single bit is set indicating "backup" without indicating what that
>> means to topology other than the fact that this virtio interface is
>> the backup for some other interface. We are essentially blind other
>> than having the link status for the VF and virtio and knowing that the
>> virtio is intended to be the backup.
>
> Would you be interested in posting at least a proof of concept
> patch for this approach? It's OK if there are some TODO stubs.
> It would be much easier to compare code to code rather than
> a high level description to code.

That is the general idea. I was hoping to go the bonding route last
week but I got too snarled up trying to untangle the features we
didn't need. I have some code I am working on but don't have an ETA
for when it will be done.

At this point I am hoping we can build something based on Sridhar's
original patch that can addresses the items I brought up and shifts to
more of a 3 netdev model. If I am not mistaken we might be able to do
it as a separate driver that has a pair of calls that allow for
adding/removing a virt-bond that is provided with a MAC address and a
device pointer so that we can do the bits necessary to get ourselves
swapped with the original virtio device and identify the virtio as the
backup channel.

Thanks.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-29 Thread Alexander Duyck
On Sun, Jan 28, 2018 at 3:02 PM, Stephen Hemminger
 wrote:
> On Fri, 26 Jan 2018 18:30:03 -0800
> Jakub Kicinski  wrote:
>
>> On Fri, 26 Jan 2018 15:30:35 -0800, Samudrala, Sridhar wrote:
>> > On 1/26/2018 2:47 PM, Jakub Kicinski wrote:
>> > > On Sat, 27 Jan 2018 00:14:20 +0200, Michael S. Tsirkin wrote:
>> > >> On Fri, Jan 26, 2018 at 01:46:42PM -0800, Siwei Liu wrote:
>> >  and the VM is not expected to do any tuning/optimizations on the VF 
>> >  driver
>> >  directly,
>> >  i think the current patch that follows the netvsc model of 2 
>> >  netdevs(virtio
>> >  and vf) should
>> >  work fine.
>> > >>> OK. For your use case that's fine. But that's too specific scenario
>> > >>> with lots of restrictions IMHO, perhaps very few users will benefit
>> > >>> from it, I'm not sure. If you're unwilling to move towards it, we'd
>> > >>> take this one and come back with a generic solution that is able to
>> > >>> address general use cases for VF/PT live migration .
>> > >> I think that's a fine approach. Scratch your own itch!  I imagine a very
>> > >> generic virtio-switchdev providing host routing info to guests could
>> > >> address lots of usecases. A driver could bind to that one and enslave
>> > >> arbitrary other devices.  Sounds reasonable.
>> > >>
>> > >> But given the fundamental idea of a failover was floated at least as
>> > >> early as 2013, and made 0 progress since precisely because it kept
>> > >> trying to address more and more features, and given netvsc is already
>> > >> using the basic solution with some success, I'm not inclined to block
>> > >> this specific effort waiting for the generic one.
>> > > I think there is an agreement that the extra netdev will be useful for
>> > > more advanced use cases, and is generally preferable.  What is the
>> > > argument for not doing that from the start?  If it was made I must have
>> > > missed it.  Is it just unwillingness to write the extra 300 lines of
>> > > code?  Sounds like a pretty weak argument when adding kernel ABI is at
>> > > stake...
>> >
>> > I am still not clear on the need for the extra netdev created by
>> > virtio_net. The only advantage i can see is that the stats can be
>> > broken between VF and virtio datapaths compared to the aggregrated
>> > stats on virtio netdev as seen with the 2 netdev approach.
>>
>> Maybe you're not convinced but multiple arguments were made.
>>
>> > With 2 netdev model, any VM image that has a working network
>> > configuration will transparently get VF based acceleration without
>> > any changes.
>>
>> Nothing happens transparently.  Things may happen automatically.  The
>> VF netdev doesn't disappear with netvsc.  The PV netdev transforms into
>> something it did not use to be.  And configures and reports some
>> information from the PV (e.g. speed) but PV doesn't pass traffic any
>> longer.
>>
>> > 3 netdev model breaks this configuration starting with the creation
>> > and naming of the 2 devices to udev needing to be aware of master and
>> > slave virtio-net devices.
>>
>> I don't understand this comment.  There is one virtio-net device and
>> one "virtio-bond" netdev.  And user space has to be aware of the special
>> automatic arrangement anyway, because it can't touch the VF.  It
>> doesn't make any difference whether it ignores the VF or PV and VF.
>> It simply can't touch the slaves, no matter how many there are.
>>
>> > Also, from a user experience point of view, loading a virtio-net with
>> > BACKUP feature enabled will now show 2 virtio-net netdevs.
>>
>> One virtio-net and one virtio-bond, which represents what's happening.
>>
>> > For live migration with advanced usecases that Siwei is suggesting, i
>> > think we need a new driver with a new device type that can track the
>> > VF specific feature settings even when the VF driver is unloaded.
>
> I see no added value of the 3 netdev model, there is no need for a bond
> device.

I agree a full-blown bond isn't what is needed. However, just forking
traffic out from virtio to a VF doesn't really solve things either.

One of the issues as I see it is the fact that the qdisc model with
the merged device gets to be pretty ugly. There is the fact that every
packet that goes to the VF has to pass through the qdisc code twice.
There is the dual nature of the 2 netdev solution that also introduces
the potential for head-of-line blocking since the virtio could put
back pressure on the upper qdisc layer which could stall the VF
traffic when switching over. I hope we could avoid issues like that by
maintaining qdiscs per device queue instead of on an upper device that
is half software interface and half not. Ideally the virtio-bond
device could operate without a qdisc and without needing any
additional locks so there shouldn't be head of line blocking occurring
between the two interfaces and overhead could be kept minimal.

Also in the case of virtio there is support for in-driver 

Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-28 Thread Alexander Duyck
On Sun, Jan 28, 2018 at 1:01 PM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
>
>
> On 1/28/2018 12:18 PM, Alexander Duyck wrote:
>>
>> On Sun, Jan 28, 2018 at 11:18 AM, Samudrala, Sridhar
>> <sridhar.samudr...@intel.com> wrote:
>>>
>>> On 1/28/2018 9:35 AM, Alexander Duyck wrote:
>>>>
>>>> On Fri, Jan 26, 2018 at 9:58 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
>>>>>
>>>>> On Fri, 26 Jan 2018 21:33:01 -0800, Samudrala, Sridhar wrote:
>>>>>>>>
>>>>>>>> 3 netdev model breaks this configuration starting with the creation
>>>>>>>> and naming of the 2 devices to udev needing to be aware of master
>>>>>>>> and
>>>>>>>> slave virtio-net devices.
>>>>>>>
>>>>>>> I don't understand this comment.  There is one virtio-net device and
>>>>>>> one "virtio-bond" netdev.  And user space has to be aware of the
>>>>>>> special
>>>>>>> automatic arrangement anyway, because it can't touch the VF.  It
>>>>>>> doesn't make any difference whether it ignores the VF or PV and VF.
>>>>>>> It simply can't touch the slaves, no matter how many there are.
>>>>>>
>>>>>> If the userspace is not expected to touch the slaves, then why do we
>>>>>> need to
>>>>>> take extra effort to expose a netdev that is just not really useful.
>>>>>
>>>>> You said:
>>>>> "[user space] needs to be aware of master and slave virtio-net
>>>>> devices."
>>>>>
>>>>> I'm saying:
>>>>> It has to be aware of the special arrangement whether there is an
>>>>> explicit bond netdev or not.
>>>>
>>>> To clarify here the kernel should be aware that there is a device that
>>>> is an aggregate of 2 other devices. It isn't as if we need to insert
>>>> the new device "above" the virtio.
>>>>
>>>> I have been doing a bit of messing around with a few ideas and it
>>>> seems like it would be better if we could replace the virtio interface
>>>> with the virtio-bond, renaming my virt-bond concept to this since it
>>>> is now supposedly going to live in the virtio driver, interface, and
>>>> then push the original virtio down one layer and call it a
>>>> virtio-backup. If I am not mistaken we could assign the same dev
>>>> pointer used by the virtio netdev to the virtio-bond, and if we
>>>> register it first with the "eth%d" name then udev will assume that the
>>>> virtio-bond device is the original virtio and all existing scripts
>>>> should just work with that. We then would want to change the name of
>>>> the virtio interface with the backup feature bit set, maybe call it
>>>> something like bkup-00:00:00 where the 00:00:00 would be the last 3
>>>> octets of the MAC address. It should solve the issue of inserting an
>>>> interface "above" the virtio by making the virtio-bond become the
>>>> virtio. The only limitation is that we will probably need to remove
>>>> the back-up if the virtio device is removed, however that was already
>>>> a limitation of this solution and others like the netvsc solution
>>>> anyway.
>>>
>>>
>>> With 3 netdev model, if we make the the master virtio-net associated with
>>> the
>>> real virtio pci device,  i think  the userspace scripts would not break.
>>> If we go this route, i am still not clear on the purpose of exposing the
>>> bkup netdev.
>>> Can we start with the 2 netdev model and move to 3 netdev model later if
>>> we
>>> find out that there are limitiations with the 2 netdev model? I don't
>>> think
>>> this will
>>> break any user API as the userspace is not expected to use the bkup
>>> netdev.
>>
>> The 2 netdev model breaks a large number of expectations of user
>> space. Among them is XDP since we cannot guarantee a symmetric setup
>> between any entity and the virtio. How does it make sense that
>> enabling XDP on virtio shows zero Rx packets, and in the meantime you
>> are getting all of the packets coming in off of the VF?
>
> Sure we cannot support XDP in this model and it needs to be disabled.
>>
>>
>> In addition we would need to rewrite th

Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-28 Thread Alexander Duyck
On Sun, Jan 28, 2018 at 11:18 AM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
> On 1/28/2018 9:35 AM, Alexander Duyck wrote:
>>
>> On Fri, Jan 26, 2018 at 9:58 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
>>>
>>> On Fri, 26 Jan 2018 21:33:01 -0800, Samudrala, Sridhar wrote:
>>>>>>
>>>>>> 3 netdev model breaks this configuration starting with the creation
>>>>>> and naming of the 2 devices to udev needing to be aware of master and
>>>>>> slave virtio-net devices.
>>>>>
>>>>> I don't understand this comment.  There is one virtio-net device and
>>>>> one "virtio-bond" netdev.  And user space has to be aware of the
>>>>> special
>>>>> automatic arrangement anyway, because it can't touch the VF.  It
>>>>> doesn't make any difference whether it ignores the VF or PV and VF.
>>>>> It simply can't touch the slaves, no matter how many there are.
>>>>
>>>> If the userspace is not expected to touch the slaves, then why do we
>>>> need to
>>>> take extra effort to expose a netdev that is just not really useful.
>>>
>>> You said:
>>> "[user space] needs to be aware of master and slave virtio-net devices."
>>>
>>> I'm saying:
>>> It has to be aware of the special arrangement whether there is an
>>> explicit bond netdev or not.
>>
>> To clarify here the kernel should be aware that there is a device that
>> is an aggregate of 2 other devices. It isn't as if we need to insert
>> the new device "above" the virtio.
>>
>> I have been doing a bit of messing around with a few ideas and it
>> seems like it would be better if we could replace the virtio interface
>> with the virtio-bond, renaming my virt-bond concept to this since it
>> is now supposedly going to live in the virtio driver, interface, and
>> then push the original virtio down one layer and call it a
>> virtio-backup. If I am not mistaken we could assign the same dev
>> pointer used by the virtio netdev to the virtio-bond, and if we
>> register it first with the "eth%d" name then udev will assume that the
>> virtio-bond device is the original virtio and all existing scripts
>> should just work with that. We then would want to change the name of
>> the virtio interface with the backup feature bit set, maybe call it
>> something like bkup-00:00:00 where the 00:00:00 would be the last 3
>> octets of the MAC address. It should solve the issue of inserting an
>> interface "above" the virtio by making the virtio-bond become the
>> virtio. The only limitation is that we will probably need to remove
>> the back-up if the virtio device is removed, however that was already
>> a limitation of this solution and others like the netvsc solution
>> anyway.
>
>
> With 3 netdev model, if we make the the master virtio-net associated with
> the
> real virtio pci device,  i think  the userspace scripts would not break.
> If we go this route, i am still not clear on the purpose of exposing the
> bkup netdev.
> Can we start with the 2 netdev model and move to 3 netdev model later if we
> find out that there are limitiations with the 2 netdev model? I don't think
> this will
> break any user API as the userspace is not expected to use the bkup netdev.

The 2 netdev model breaks a large number of expectations of user
space. Among them is XDP since we cannot guarantee a symmetric setup
between any entity and the virtio. How does it make sense that
enabling XDP on virtio shows zero Rx packets, and in the meantime you
are getting all of the packets coming in off of the VF?

In addition we would need to rewrite the VLAN and MAC address
filtering ndo operations since we likely cannot add any VLANs since in
most cases VFs are VLAN locked due to things like port VLAN and we
cannot change the MAC address since the whole bonding concept is built
around it.

The last bit is the netpoll packet routing which the current code
assumes is using the virtio only, but I don't know if that is a valid
assumption since the VF is expected to be the default route for
everything else when it is available.

The point is by the time you are done you will have rewritten pretty
much all the network device ops. With that being the case why add all
the code to virtio itself instead of just coming up with a brand new
set of ndo_ops that belong to this new device, and you could leave the
existing virtio code in place and save yourself a bunch of time by
just accessing it as an existing call as a separate netdev.

>>>>>> Also, from a user experie

Re: [virtio-dev] Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available

2018-01-28 Thread Alexander Duyck
On Fri, Jan 26, 2018 at 9:58 PM, Jakub Kicinski  wrote:
> On Fri, 26 Jan 2018 21:33:01 -0800, Samudrala, Sridhar wrote:
>> >> 3 netdev model breaks this configuration starting with the creation
>> >> and naming of the 2 devices to udev needing to be aware of master and
>> >> slave virtio-net devices.
>> > I don't understand this comment.  There is one virtio-net device and
>> > one "virtio-bond" netdev.  And user space has to be aware of the special
>> > automatic arrangement anyway, because it can't touch the VF.  It
>> > doesn't make any difference whether it ignores the VF or PV and VF.
>> > It simply can't touch the slaves, no matter how many there are.
>>
>> If the userspace is not expected to touch the slaves, then why do we need to
>> take extra effort to expose a netdev that is just not really useful.
>
> You said:
> "[user space] needs to be aware of master and slave virtio-net devices."
>
> I'm saying:
> It has to be aware of the special arrangement whether there is an
> explicit bond netdev or not.

To clarify here the kernel should be aware that there is a device that
is an aggregate of 2 other devices. It isn't as if we need to insert
the new device "above" the virtio.

I have been doing a bit of messing around with a few ideas and it
seems like it would be better if we could replace the virtio interface
with the virtio-bond, renaming my virt-bond concept to this since it
is now supposedly going to live in the virtio driver, interface, and
then push the original virtio down one layer and call it a
virtio-backup. If I am not mistaken we could assign the same dev
pointer used by the virtio netdev to the virtio-bond, and if we
register it first with the "eth%d" name then udev will assume that the
virtio-bond device is the original virtio and all existing scripts
should just work with that. We then would want to change the name of
the virtio interface with the backup feature bit set, maybe call it
something like bkup-00:00:00 where the 00:00:00 would be the last 3
octets of the MAC address. It should solve the issue of inserting an
interface "above" the virtio by making the virtio-bond become the
virtio. The only limitation is that we will probably need to remove
the back-up if the virtio device is removed, however that was already
a limitation of this solution and others like the netvsc solution
anyway.

>> >> Also, from a user experience point of view, loading a virtio-net with
>> >> BACKUP feature enabled will now show 2 virtio-net netdevs.
>> > One virtio-net and one virtio-bond, which represents what's happening.
>> This again assumes that we want to represent a bond setup. Can't we
>> treat this
>> as virtio-net providing an alternate low-latency datapath by taking over
>> VF datapath?
>
> Bond is just a familiar name, we can call it something else if you
> prefer.  The point is there are two data paths which can have
> independent low-level settings and a higher level entity with
> global settings which represents any path to the outside world.
>
> Hiding low-level netdevs from a lay user requires a generic solution.

The last thing I think we should be doing is hiding the low level
netdevs. It doesn't solve anythying. We are already trusting the owner
of the VM enough to let them have root access of the VM. That means
they can load/unload any driver they want. They don't have to use the
kernel provided virtio driver, they could load their own. They could
even do something like run DPDK on top of it, or they could run DPDK
on top of the VF. In either case there is no way the bond would ever
be created and all hiding devices does is make it easier to fix
problems when something gets broken. Unless I am mistaken, and
"security through obscurity" has somehow become a valid security
model.

As I mentioned to Sridhar on an off-list thread I think the goal
should be to make it so that the user wants to use the virtio-bond,
not make it so that they have no choice but to use it. The idea is we
should be making things easier for the administrator of the VM, not
harder.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-01-22 Thread Alexander Duyck
On Mon, Jan 22, 2018 at 6:04 PM, Michael S. Tsirkin  wrote:
> On Mon, Jan 22, 2018 at 05:34:37PM -0800, Samudrala, Sridhar wrote:
>> On 1/22/2018 4:05 PM, Michael S. Tsirkin wrote:
>> > On Mon, Jan 22, 2018 at 03:27:40PM -0800, Samudrala, Sridhar wrote:
>> > > > > You could probably
>> > > > > even handle the Tx queue selection via a simple eBPF program and map
>> > > > > since the input for whatever is used to select Tx should be pretty
>> > > > > simple, destination MAC, source NUMA node, etc, and the data-set
>> > > > > shouldn't be too large.
>> > > > That sounds interesting. A separate device might make this kind of 
>> > > > setup
>> > > > a bit easier.  Sridhar, did you look into creating a separate device 
>> > > > for
>> > > > the virtual bond device at all?  It does not have to be in a separate
>> > > > module, that kind of refactoring can come later, but once we commit to
>> > > > using the same single device as virtio, we can't change that.
>> > > No. I haven't looked into creating a separate device. If we are going to
>> > > create a new
>> > > device, i guess it has to be of a new device type with its own driver.
>> > Well not necessarily - just a separate netdev ops.
>> > Kind of like e.g. vlans share a driver with the main driver.
>>
>> Not sure what you meant by vlans sharing a driver with the main driver.
>> IIUC, vlans are supported via 802.1q driver and  creates a netdev of type
>> 'vlan'
>> with vlan_netdev_ops
>
> But nothing prevents a single module from registering
> multiple ops.

Just to clarify, it seems like what you are suggesting is just adding
the "master" as a separate set of netdev ops or netdevice and to have
virtio spawn two network devices, one slave and one master, if the
BACKUP bit is set. Do I have that right?

I am good with the code still living in the virtio driver and
consolidation with other similar implementations and further
improvement could probably happen later as part of some refactor.

Thanks.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-01-17 Thread Alexander Duyck
On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin <m...@redhat.com> wrote:
> On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:
>>
>>
>> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:
>> > On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
>> > > On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
>> > > <sridhar.samudr...@intel.com> wrote:
>> > > > This feature bit can be used by hypervisor to indicate virtio_net 
>> > > > device to
>> > > > act as a backup for another device with the same MAC address.
>> > > >
>> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudr...@intel.com>
>> > > > ---
>> > > >   drivers/net/virtio_net.c| 2 +-
>> > > >   include/uapi/linux/virtio_net.h | 3 +++
>> > > >   2 files changed, 4 insertions(+), 1 deletion(-)
>> > > >
>> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> > > > index 12dfc5fee58e..f149a160a8c5 100644
>> > > > --- a/drivers/net/virtio_net.c
>> > > > +++ b/drivers/net/virtio_net.c
>> > > > @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
>> > > >  VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>> > > >  VIRTIO_NET_F_CTRL_MAC_ADDR, \
>> > > >  VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>> > > > -   VIRTIO_NET_F_SPEED_DUPLEX
>> > > > +   VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
>> > > >
>> > > >   static unsigned int features[] = {
>> > > >  VIRTNET_FEATURES,
>> > > > diff --git a/include/uapi/linux/virtio_net.h 
>> > > > b/include/uapi/linux/virtio_net.h
>> > > > index 5de6ed37695b..c7c35fd1a5ed 100644
>> > > > --- a/include/uapi/linux/virtio_net.h
>> > > > +++ b/include/uapi/linux/virtio_net.h
>> > > > @@ -57,6 +57,9 @@
>> > > >   * Steering */
>> > > >   #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
>> > > >
>> > > > +#define VIRTIO_NET_F_BACKUP  62/* Act as backup for another 
>> > > > device
>> > > > +* with the same MAC.
>> > > > +*/
>> > > >   #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and 
>> > > > duplex */
>> > > >
>> > > >   #ifndef VIRTIO_NET_NO_LEGACY
>> > > I'm not a huge fan of the name "backup" since that implies that the
>> > > Virtio interface is only used if the VF is not present, and there are
>> > > multiple instances such as dealing with east/west or
>> > > broadcast/multicast traffic where it may be desirable to use the
>> > > para-virtual interface rather then deal with PCI overhead/bottleneck
>> > > to send the packet.
>> > Right now hypervisors mostly expect that yes, only one at a time is
>> > used.  E.g. if you try to do multicast sending packets on both VF and
>> > virtio then you will end up with two copies of each packet.
>>
>> I think we want to use only 1 interface to  send out any packet. In case of
>> broadcast/multicasts it would be an optimization to send them via virtio and
>> this patch series adds that optimization.
>
> Right that's what I think we should rather avoid for now.
>
> It's *not* an optimization if there's a single VM on this host,
> or if a specific multicast group does not have any VMs on same
> host.

Agreed. In my mind this is something that is controlled by the
pass-thru interface once it is enslaved.

> I'd rather we just sent everything out on the PT if that's
> there. The reason we have virtio in the picture is just so
> we can migrate without downtime.

I wasn't saying we do that in all cases. That would be something that
would have to be decided by the pass-thru interface. Ideally the
virtio would provide just enough information to get itself into the
bond and I see this being the mechanism for it to do so. From there
the complexity mostly lies in the pass-thru interface to configure the
correct transmit modes if for example you have multiple pass-thru
interfaces or a more complex traffic setup due to things like
SwitchDev.

In my mind we go the bonding route and there are few use cases for all
of this. First is the backup case that is being addressed here. That
becomes your basic 

Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit

2018-01-17 Thread Alexander Duyck
On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
 wrote:
> This feature bit can be used by hypervisor to indicate virtio_net device to
> act as a backup for another device with the same MAC address.
>
> Signed-off-by: Sridhar Samudrala 
> ---
>  drivers/net/virtio_net.c| 2 +-
>  include/uapi/linux/virtio_net.h | 3 +++
>  2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 12dfc5fee58e..f149a160a8c5 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
> VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> VIRTIO_NET_F_CTRL_MAC_ADDR, \
> VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> -   VIRTIO_NET_F_SPEED_DUPLEX
> +   VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
>
>  static unsigned int features[] = {
> VIRTNET_FEATURES,
> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> index 5de6ed37695b..c7c35fd1a5ed 100644
> --- a/include/uapi/linux/virtio_net.h
> +++ b/include/uapi/linux/virtio_net.h
> @@ -57,6 +57,9 @@
>  * Steering */
>  #define VIRTIO_NET_F_CTRL_MAC_ADDR 23  /* Set MAC address */
>
> +#define VIRTIO_NET_F_BACKUP  62/* Act as backup for another device
> +* with the same MAC.
> +*/
>  #define VIRTIO_NET_F_SPEED_DUPLEX 63   /* Device set linkspeed and duplex */
>
>  #ifndef VIRTIO_NET_NO_LEGACY

I'm not a huge fan of the name "backup" since that implies that the
Virtio interface is only used if the VF is not present, and there are
multiple instances such as dealing with east/west or
broadcast/multicast traffic where it may be desirable to use the
para-virtual interface rather then deal with PCI overhead/bottleneck
to send the packet.

What if instead of BACKUP we used the name SIDE_CHANNEL? Basically it
is a bit of double entendre as we are using the physical MAC address
to provide configuration information, and then in addition this
interface acts as a secondary channel for passing frames to and from
the guest rather than just using the VF.

Just a thought.

Thanks.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next 0/2] Enable virtio to act as a master for a passthru device

2018-01-03 Thread Alexander Duyck
On Wed, Jan 3, 2018 at 10:14 AM, Samudrala, Sridhar
<sridhar.samudr...@intel.com> wrote:
>
>
> On 1/3/2018 8:59 AM, Alexander Duyck wrote:
>>
>> On Tue, Jan 2, 2018 at 6:16 PM, Jakub Kicinski <kubak...@wp.pl> wrote:
>>>
>>> On Tue,  2 Jan 2018 16:35:36 -0800, Sridhar Samudrala wrote:
>>>>
>>>> This patch series enables virtio to switch over to a VF datapath when a
>>>> VF
>>>> netdev is present with the same MAC address. It allows live migration of
>>>> a VM
>>>> with a direct attached VF without the need to setup a bond/team between
>>>> a
>>>> VF and virtio net device in the guest.
>>>>
>>>> The hypervisor needs to unplug the VF device from the guest on the
>>>> source
>>>> host and reset the MAC filter of the VF to initiate failover of datapath
>>>> to
>>>> virtio before starting the migration. After the migration is completed,
>>>> the
>>>> destination hypervisor sets the MAC filter on the VF and plugs it back
>>>> to
>>>> the guest to switch over to VF datapath.
>>>>
>>>> It is based on netvsc implementation and it may be possible to make this
>>>> code
>>>> generic and move it to a common location that can be shared by netvsc
>>>> and virtio.
>>>>
>>>> This patch series is based on the discussion initiated by Jesse on this
>>>> thread.
>>>> https://marc.info/?l=linux-virtualization=151189725224231=2
>>>
>>> How does the notion of a device which is both a bond and a leg of a
>>> bond fit with Alex's recent discussions about feature propagation?
>>> Which propagation rules will apply to VirtIO master?  Meaning of the
>>> flags on a software upper device may be different.  Why muddy the
>>> architecture like this and not introduce a synthetic bond device?
>>
>> It doesn't really fit with the notion I had. I think there may have
>> been a bit of a disconnect as I have been out for the last week or so
>> for the holidays.
>>
>> My thought on this was that the feature bit should be spawning a new
>> para-virtual bond device and that bond should have the virto and the
>> VF as slaves. Also I thought there was some discussion about trying to
>> reuse as much of the netvsc code as possible for this so that we could
>> avoid duplication of effort and have the two drivers use the same
>> approach. It seems like it should be pretty straight forward since you
>> would have the feature bit in the case of virto, and netvsc just does
>> this sort of thing by default if I am not mistaken.
>
> This patch is mostly based on netvsc implementation. The only change is
> avoiding the
> explicit dev_open() call of the VF netdev after a delay. I am assuming that
> the guest userspace
> will bring up the VF netdev and the hypervisor will update the MAC filters
> to switch to
> the right data path.
> We could commonize the code and make it shared between netvsc and virtio. Do
> we want
> to do this right away or later? If so, what would be a good location for
> these shared functions?
> Is it net/core/dev.c?

No, I would think about starting a new driver file in "/drivers/net/".
The idea is this driver would be utilized to create a bond
automatically and set the appropriate registration hooks. If nothing
else you could probably just call it something generic like virt-bond
or vbond or whatever.

> Also, if we want to go with a solution that creates a bond device, do we
> want virtio_net/netvsc
> drivers to create a upper device?  Such a solution is already possible via
> config scripts that can
> create a bond with virtio and a VF net device as slaves.  netvsc and this
> patch series is trying to
> make it as simple as possible for the VM to use directly attached devices
> and support live migration
> by switching to virtio datapath as a backup during the migration process
> when the VF device
> is unplugged.

We all understand that. But you are making the solution very virtio
specific. We want to see this be usable for other interfaces such as
netsc and whatever other virtual interfaces are floating around out
there.

Also I haven't seen us address what happens as far as how we will
handle this on the host. My thought was we should have a paired
interface. Something like veth, but made up of a bond on each end. So
in the host we should have one bond that has a tap/vhost interface and
a VF port representor, and on the other we would be looking at the
virtio interface and the VF. Attaching the tap/vhost t

Re: [PATCH net-next 0/2] Enable virtio to act as a master for a passthru device

2018-01-03 Thread Alexander Duyck
On Tue, Jan 2, 2018 at 6:16 PM, Jakub Kicinski  wrote:
> On Tue,  2 Jan 2018 16:35:36 -0800, Sridhar Samudrala wrote:
>> This patch series enables virtio to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. It allows live migration of a VM
>> with a direct attached VF without the need to setup a bond/team between a
>> VF and virtio net device in the guest.
>>
>> The hypervisor needs to unplug the VF device from the guest on the source
>> host and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed, the
>> destination hypervisor sets the MAC filter on the VF and plugs it back to
>> the guest to switch over to VF datapath.
>>
>> It is based on netvsc implementation and it may be possible to make this code
>> generic and move it to a common location that can be shared by netvsc and 
>> virtio.
>>
>> This patch series is based on the discussion initiated by Jesse on this 
>> thread.
>> https://marc.info/?l=linux-virtualization=151189725224231=2
>
> How does the notion of a device which is both a bond and a leg of a
> bond fit with Alex's recent discussions about feature propagation?
> Which propagation rules will apply to VirtIO master?  Meaning of the
> flags on a software upper device may be different.  Why muddy the
> architecture like this and not introduce a synthetic bond device?

It doesn't really fit with the notion I had. I think there may have
been a bit of a disconnect as I have been out for the last week or so
for the holidays.

My thought on this was that the feature bit should be spawning a new
para-virtual bond device and that bond should have the virto and the
VF as slaves. Also I thought there was some discussion about trying to
reuse as much of the netvsc code as possible for this so that we could
avoid duplication of effort and have the two drivers use the same
approach. It seems like it should be pretty straight forward since you
would have the feature bit in the case of virto, and netvsc just does
this sort of thing by default if I am not mistaken.

- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-13 Thread Alexander Duyck
On Wed, Dec 6, 2017 at 11:28 PM, achiad shochat
 wrote:
> On 5 December 2017 at 21:20, Michael S. Tsirkin  wrote:
>> On Tue, Dec 05, 2017 at 11:59:17AM +0200, achiad shochat wrote:
>>> Then we'll have a single solution for both netvsc and virtio (and any
>>> other PV device).
>>> And we could handle the VF DMA dirt issue agnostically.
>>
>> For the record, I won't block patches adding this kist to virtio
>> on the basis that they must be generic. It's not a lot
>> of code, implementation can come first, prettify later.
>
> It's not a lot of code either way.
> So I fail to understand why not to do it right from the beginning.
> For the record...

What isn't a lot of code? If you are talking about the DMA dirtying
then I would have to disagree. The big problem with the DMA is that we
have to mark a page as dirty and non-migratable as soon as it is
mapped for Rx DMA. It isn't until the driver has either unmapped the
page or the device has been disabled that we can then allow the page
to be migrated for being dirty. That ends up being the way we have to
support this if we don't have the bonding solution.

With the bonding solution we could look at doing a lightweight DMA
dirtying which would just require flagging pages as dirty after an
unmap or sync call is performed. However it requires that we shut down
the driver/device before we can complete the migration which means we
have to have the paravirtualized fail-over approach.

As far as indicating that the interfaces are meant to be enslaved I
wonder if we couldn't look at tweaking the PCI layout of the guest and
use that to indicate that a given set of interfaces are meant to be
bonded. For example the VFs are all meant to work as a part of a
multi-function device. What if we were to make virtio-net function 0
of a PCI/PCIe device, and then place any direct assigned VFs that are
meant to be a part of the bond in functions 1-7 of the device? Then it
isn't too far off from the model we have on the host where if the VF
goes away we would expect to see the traffic on the PF that is usually
occupying function 0 of a given device.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: help live migrate SR-IOV devices

2017-12-13 Thread Alexander Duyck
On Mon, Dec 4, 2017 at 1:51 AM, achiad shochat
 wrote:
> On 3 December 2017 at 19:35, Stephen Hemminger
>  wrote:
>> On Sun, 3 Dec 2017 11:14:37 +0200
>> achiad shochat  wrote:
>>
>>> On 3 December 2017 at 07:05, Michael S. Tsirkin  wrote:
>>> > On Fri, Dec 01, 2017 at 12:08:59PM -0800, Shannon Nelson wrote:
>>> >> On 11/30/2017 6:11 AM, Michael S. Tsirkin wrote:
>>> >> > On Thu, Nov 30, 2017 at 10:08:45AM +0200, achiad shochat wrote:
>>> >> > > Re. problem #2:
>>> >> > > Indeed the best way to address it seems to be to enslave the VF 
>>> >> > > driver
>>> >> > > netdev under a persistent anchor netdev.
>>> >> > > And it's indeed desired to allow (but not enforce) PV netdev and VF
>>> >> > > netdev to work in conjunction.
>>> >> > > And it's indeed desired that this enslavement logic work out-of-the 
>>> >> > > box.
>>> >> > > But in case of PV+VF some configurable policies must be in place (and
>>> >> > > they'd better be generic rather than differ per PV technology).
>>> >> > > For example - based on which characteristics should the PV+VF 
>>> >> > > coupling
>>> >> > > be done? netvsc uses MAC address, but that might not always be the
>>> >> > > desire.
>>> >> >
>>> >> > It's a policy but not guest userspace policy.
>>> >> >
>>> >> > The hypervisor certainly knows.
>>> >> >
>>> >> > Are you concerned that someone might want to create two devices with 
>>> >> > the
>>> >> > same MAC for an unrelated reason?  If so, hypervisor could easily set a
>>> >> > flag in the virtio device to say "this is a backup, use MAC to find
>>> >> > another device".
>>> >>
>>> >> This is something I was going to suggest: a flag or other configuration 
>>> >> on
>>> >> the virtio device to help control how this new feature is used.  I can
>>> >> imagine this might be useful to control from either the hypervisor side 
>>> >> or
>>> >> the VM side.
>>> >>
>>> >> The hypervisor might want to (1) disable it (force it off), (2) enable it
>>> >> for VM choice, or (3) force it on for the VM.  In case (2), the VM might 
>>> >> be
>>> >> able to chose whether it wants to make use of the feature, or stick with 
>>> >> the
>>> >> bonding solution.
>>> >>
>>> >> Either way, the kernel is making a feature available, and the user (VM or
>>> >> hypervisor) is able to control it by selecting the feature based on the
>>> >> policy desired.
>>> >>
>>> >> sln
>>> >
>>> > I'm not sure what's the feature that is available here.
>>> >
>>> > I saw this as a flag that says "this device shares backend with another
>>> > network device which can be found using MAC, and that backend should be
>>> > preferred".  kernel then forces configuration which uses that other
>>> > backend - as long as it exists.
>>> >
>>> > However, please Cc virtio-dev mailing list if we are doing this since
>>> > this is a spec extension.
>>> >
>>> > --
>>> > MST
>>>
>>>
>>> Can someone please explain why assume a virtio device is there at all??
>>> I specified a case where there isn't any.

Migrating without any virtual device is going to be extremely
challenging, especially in any kind of virtualization setup where the
hosts are not homogeneous. By providing a virtio interface you can
guarantee that at least 1 network interface is available on any given
host, and then fail over to that as the least common denominator for
any migration.

>>> I second Jacob - having a netdev of one device driver enslave a netdev
>>> of another device driver is an awkward a-symmetric model.
>>> Regardless of whether they share the same backend device.
>>> Only I am not sure the Linux Bond is the right choice.
>>> e.g one may well want to use the virtio device also when the
>>> pass-through device is available, e.g for multicasts, east-west
>>> traffic, etc.
>>> I'm not sure the Linux Bond fits that functionality.
>>> And, as I hear in this thread, it is hard to make it work out of the box.
>>> So I think the right thing would be to write a new dedicated module
>>> for this purpose.

This part I can sort of agree with. What if we were to look at
providing a way to somehow advertise that the two devices were meant
to be boded for virtualization purposes? For now lets call it a
"virt-bond". Basically we could look at providing a means for virtio
and VF drivers to advertise that they want this sort of bond. Then it
would just be a matter of providing some sort of side channel to
indicate where you want things like multicast/broadcast/east-west
traffic to go.

>>> Re policy -
>>> Indeed the HV can request a policy from the guest but that's not a
>>> claim for the virtio device enslaving the pass-through device.
>>> Any policy can be queried by the upper enslaving device.
>>>
>>> Bottom line - I do not see a single reason to have the virtio netdev
>>> (nor netvsc or any other PV netdev) enslave another netdev by itself.
>>> If we'd do it right with netvsc from the beginning we wouldn't need
>>> this 

Re: [PATCH] virtio_ring: Update weak barriers to use dma_wmb/rmb

2015-04-08 Thread Alexander Duyck


On 04/08/2015 01:42 AM, Michael S. Tsirkin wrote:

On Tue, Apr 07, 2015 at 05:47:42PM -0700, Alexander Duyck wrote:

This change makes it so that instead of using smp_wmb/rmb which varies
depending on the kernel configuration we can can use dma_wmb/rmb which for
most architectures should be equal to or slightly more strict than
smp_wmb/rmb.

The advantage to this is that these barriers are available to uniprocessor
builds as well so the performance should improve under such a
configuration.

Signed-off-by: Alexander Duyck alexander.h.du...@redhat.com

Well the generic implementation has:
#ifndef dma_rmb
#define dma_rmb()   rmb()
#endif

#ifndef dma_wmb
#define dma_wmb()   wmb()
#endif

So for these arches you are slightly speeding up UP but slightly hurting SMP -
I think we did benchmark the difference as measureable in the past.


The generic implementation for the smp_ barriers does the same thing 
when CONFIG_SMP is defined.  The only spot where there should be an 
appreciable difference between the two is on ARM where we define the 
dma_ barriers as being in the outer shareable domain, and for the smp_ 
barriers they are inner shareable domain.



Additionally, isn't this relying on undocumented behaviour?
The documentation says:
These are for use with consistent memory
and virtio does not bother to request consistent memory
allocations.


Consistent in this case represents memory that exists within one 
coherency domain.  So in the case of x86 for instance this represents 
writes only to system memory.  If you mix writes to system memory and 
device memory (PIO) then you should be using the full wmb/rmb to 
guarantee ordering between the two memories.



One wonders whether these will always be strong enough.


For the purposes of weak barriers they should be, and they are only 
slightly stronger than SMP in one case so odds are strength will not be 
the issue.  As far as speed I would suspect that the difference between 
inner and outer shareable domain should be negligible compared to the 
difference between a dsb() and a dmb().


- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio_ring: Update weak barriers to use dma_wmb/rmb

2015-04-08 Thread Alexander Duyck

On 04/08/2015 11:37 AM, Michael S. Tsirkin wrote:

On Wed, Apr 08, 2015 at 07:41:49AM -0700, Alexander Duyck wrote:

On 04/08/2015 01:42 AM, Michael S. Tsirkin wrote:

On Tue, Apr 07, 2015 at 05:47:42PM -0700, Alexander Duyck wrote:

This change makes it so that instead of using smp_wmb/rmb which varies
depending on the kernel configuration we can can use dma_wmb/rmb which for
most architectures should be equal to or slightly more strict than
smp_wmb/rmb.

The advantage to this is that these barriers are available to uniprocessor
builds as well so the performance should improve under such a
configuration.

Signed-off-by: Alexander Duyck alexander.h.du...@redhat.com

Well the generic implementation has:
#ifndef dma_rmb
#define dma_rmb()   rmb()
#endif

#ifndef dma_wmb
#define dma_wmb()   wmb()
#endif

So for these arches you are slightly speeding up UP but slightly hurting SMP -
I think we did benchmark the difference as measureable in the past.

The generic implementation for the smp_ barriers does the same thing when
CONFIG_SMP is defined.  The only spot where there should be an appreciable
difference between the two is on ARM where we define the dma_ barriers as
being in the outer shareable domain, and for the smp_ barriers they are
inner shareable domain.


Additionally, isn't this relying on undocumented behaviour?
The documentation says:
These are for use with consistent memory
and virtio does not bother to request consistent memory
allocations.

Consistent in this case represents memory that exists within one coherency
domain.  So in the case of x86 for instance this represents writes only to
system memory.  If you mix writes to system memory and device memory (PIO)
then you should be using the full wmb/rmb to guarantee ordering between the
two memories.


One wonders whether these will always be strong enough.

For the purposes of weak barriers they should be, and they are only slightly
stronger than SMP in one case so odds are strength will not be the issue.
As far as speed I would suspect that the difference between inner and outer
shareable domain should be negligible compared to the difference between a
dsb() and a dmb().

- Alex

Maybe it's safe, and maybe there's no performance impact.  But what's
the purpose of the patch?  From the commit log, It sounds like it's an
optimization, but it's not an obvious win, and it's not accompanied by
any numbers.


The win would be that non-SMP should get the same performance from the 
barriers as SMP.  Based on the numbers for commit 7b21e34fd1c2 (virtio: 
harsher barriers for rpmsg.) it sounds like the gains could be pretty 
significant (TCP_RR test improved by 35% CPU, 14% throughput).  The idea 
is to get the same benefits in a uniprocessor environment.  If needed I 
can gather the data for x86 for SMP and non-SMP, however I had 
considered the patch to be low hanging fruit on that architecture since 
the smp_ and dma_ barriers are the same.


The performance numbers that I would like to collect but can't would be 
on ARM 7 or later as that is the only spot where the smp_ and dma_ 
barriers differ in any significant way, however I don't have an ARM 
platform that I could test this patch on to generate such data.


- Alex
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] virtio_ring: Update weak barriers to use dma_wmb/rmb

2015-04-07 Thread Alexander Duyck
This change makes it so that instead of using smp_wmb/rmb which varies
depending on the kernel configuration we can can use dma_wmb/rmb which for
most architectures should be equal to or slightly more strict than
smp_wmb/rmb.

The advantage to this is that these barriers are available to uniprocessor
builds as well so the performance should improve under such a
configuration.

Signed-off-by: Alexander Duyck alexander.h.du...@redhat.com
---
 include/linux/virtio_ring.h |   23 ---
 1 file changed, 4 insertions(+), 19 deletions(-)

diff --git a/include/linux/virtio_ring.h b/include/linux/virtio_ring.h
index 67e06fe18c03..8e50888a6d59 100644
--- a/include/linux/virtio_ring.h
+++ b/include/linux/virtio_ring.h
@@ -21,19 +21,20 @@
  * actually quite cheap.
  */
 
-#ifdef CONFIG_SMP
 static inline void virtio_mb(bool weak_barriers)
 {
+#ifdef CONFIG_SMP
if (weak_barriers)
smp_mb();
else
+#endif
mb();
 }
 
 static inline void virtio_rmb(bool weak_barriers)
 {
if (weak_barriers)
-   smp_rmb();
+   dma_rmb();
else
rmb();
 }
@@ -41,26 +42,10 @@ static inline void virtio_rmb(bool weak_barriers)
 static inline void virtio_wmb(bool weak_barriers)
 {
if (weak_barriers)
-   smp_wmb();
+   dma_wmb();
else
wmb();
 }
-#else
-static inline void virtio_mb(bool weak_barriers)
-{
-   mb();
-}
-
-static inline void virtio_rmb(bool weak_barriers)
-{
-   rmb();
-}
-
-static inline void virtio_wmb(bool weak_barriers)
-{
-   wmb();
-}
-#endif
 
 struct virtio_device;
 struct virtqueue;

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4] x86/xen: Use __pa_symbol instead of __pa on C visible symbols

2012-11-19 Thread Alexander Duyck
This change updates a few of the functions to use __pa_symbol when
translating C visible symbols instead of __pa.  By using __pa_symbol we are
able to drop a few extra lines of code as don't have to test to see if the
virtual pointer is a part of the kernel text or just standard virtual memory.

Cc: Konrad Rzeszutek Wilk konrad.w...@oracle.com
Signed-off-by: Alexander Duyck alexander.h.du...@intel.com
---

v4:  I have spun this patch off as a separate patch for v4 due to the fact that
 this patch doesn't apply cleanly to Linus's tree.  As such I am
 submitting it based off of the linux-next tree to be accepted in the Xen
 tree since this patch can actually exist on its own without the need
 for the other patches in the original __phys_addr performance series.

 arch/x86/xen/mmu.c |   21 +++--
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 4a05b39..a63e5f9 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1486,7 +1486,8 @@ static int xen_pgd_alloc(struct mm_struct *mm)
 
if (user_pgd != NULL) {
user_pgd[pgd_index(VSYSCALL_START)] =
-   __pgd(__pa(level3_user_vsyscall) | _PAGE_TABLE);
+   __pgd(__pa_symbol(level3_user_vsyscall) |
+ _PAGE_TABLE);
ret = 0;
}
 
@@ -1958,10 +1959,10 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
 * pgd.
 */
if (xen_feature(XENFEAT_writable_page_tables)) {
-   native_write_cr3(__pa(init_level4_pgt));
+   native_write_cr3(__pa_symbol(init_level4_pgt));
} else {
xen_mc_batch();
-   __xen_write_cr3(true, __pa(init_level4_pgt));
+   __xen_write_cr3(true, __pa_symbol(init_level4_pgt));
xen_mc_issue(PARAVIRT_LAZY_CPU);
}
/* We can't that easily rip out L3 and L2, as the Xen pagetables are
@@ -1984,10 +1985,10 @@ static RESERVE_BRK_ARRAY(pmd_t, swapper_kernel_pmd, 
PTRS_PER_PMD);
 
 static void __init xen_write_cr3_init(unsigned long cr3)
 {
-   unsigned long pfn = PFN_DOWN(__pa(swapper_pg_dir));
+   unsigned long pfn = PFN_DOWN(__pa_symbol(swapper_pg_dir));
 
-   BUG_ON(read_cr3() != __pa(initial_page_table));
-   BUG_ON(cr3 != __pa(swapper_pg_dir));
+   BUG_ON(read_cr3() != __pa_symbol(initial_page_table));
+   BUG_ON(cr3 != __pa_symbol(swapper_pg_dir));
 
/*
 * We are switching to swapper_pg_dir for the first time (from
@@ -2011,7 +2012,7 @@ static void __init xen_write_cr3_init(unsigned long cr3)
pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE, pfn);
 
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE,
- PFN_DOWN(__pa(initial_page_table)));
+ PFN_DOWN(__pa_symbol(initial_page_table)));
set_page_prot(initial_page_table, PAGE_KERNEL);
set_page_prot(initial_kernel_pmd, PAGE_KERNEL);
 
@@ -2036,7 +2037,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
 
copy_page(initial_page_table, pgd);
initial_page_table[KERNEL_PGD_BOUNDARY] =
-   __pgd(__pa(initial_kernel_pmd) | _PAGE_PRESENT);
+   __pgd(__pa_symbol(initial_kernel_pmd) | _PAGE_PRESENT);
 
set_page_prot(initial_kernel_pmd, PAGE_KERNEL_RO);
set_page_prot(initial_page_table, PAGE_KERNEL_RO);
@@ -2045,8 +2046,8 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, 
unsigned long max_pfn)
pin_pagetable_pfn(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(pgd)));
 
pin_pagetable_pfn(MMUEXT_PIN_L3_TABLE,
- PFN_DOWN(__pa(initial_page_table)));
-   xen_write_cr3(__pa(initial_page_table));
+ PFN_DOWN(__pa_symbol(initial_page_table)));
+   xen_write_cr3(__pa_symbol(initial_page_table));
 
memblock_reserve(__pa(xen_start_info-pt_base),
 xen_start_info-nr_pt_frames * PAGE_SIZE);

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization