Re: [PATCH net-next v2] net: vhost: improve performance when enable busyloop

2018-06-27 Thread Jason Wang



On 2018年06月27日 23:58, Michael S. Tsirkin wrote:

On Wed, Jun 27, 2018 at 10:24:43PM +0800, Jason Wang wrote:


On 2018年06月26日 13:17, xiangxia.m@gmail.com wrote:

From: Tonghao Zhang 

This patch improves the guest receive performance from
host. On the handle_tx side, we poll the sock receive
queue at the same time. handle_rx do that in the same way.

For avoiding deadlock, change the code to lock the vq one
by one and use the VHOST_NET_VQ_XX as a subclass for
mutex_lock_nested. With the patch, qemu can set differently
the busyloop_timeout for rx or tx queue.

We set the poll-us=100us and use the iperf3 to test
its throughput. The iperf3 command is shown as below.

on the guest:
iperf3  -s -D

on the host:
iperf3  -c 192.168.1.100 -i 1 -P 10 -t 10 -M 1400

* With the patch: 23.1 Gbits/sec
* Without the patch:  12.7 Gbits/sec

Signed-off-by: Tonghao Zhang 

Thanks a lot for the patch. Looks good generally, but please split this big
patch into separate ones like:

patch 1: lock vqs one by one
patch 2: replace magic number of lock annotation
patch 3: factor out generic busy polling logic to vhost_net_busy_poll()
patch 4: add rx busy polling in tx path.

And please cc Michael in v3.

Thanks

Pls include host CPU utilization numbers. You can get them e.g. using
vmstat. I suspect we also want the polling controllable e.g. through
an ioctl.



I believe we had an ioctl for setting timeout? Or you want another kind 
of controlling.


Thanks

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2 0/5] Add virtio-iommu driver

2018-06-27 Thread Michael S. Tsirkin
On Wed, Jun 27, 2018 at 07:04:46PM +0100, Jean-Philippe Brucker wrote:
> On 26/06/18 19:07, Michael S. Tsirkin wrote:
> > So as I pointed out new virtio 0 device isn't really welcome ;)
> 
> Agreed, virtio-iommu is expected to be implemented on virtio 1 and
> later. I'll remove the two legacy-related paragraph from the spec and
> add a check in the driver as you suggested, to avoid giving the wrong idea.
> 
> > No one bothered implementing virtio 1 in MMIO for all the work
> > that was put in defining it.
> 
> That is curious, given that the virtio_mmio driver supports virtio 1 and
> from my own experience, porting the MMIO transport to virtio 1 only
> requires updating a few registers, when ANY_LAYOUT, modern virtqueue and
> status are already implemented.
> 
> > The fact that the MMIO part of the
> > spec doesn't seem to allow for transitional devices might
> > or might not have something to do with this.
> 
> Sorry, which part do you have in mind? The spec does provide both a
> legacy and modern register layout, with version numbers to differentiate
> them.

Yes but there's no way to implement a transitional virtio mmio
device. The version is either "legacy" or "modern".

So if you implement a modern device old guests won't work with it.

> > So why make it an MMIO device at all? A system with an IOMMU
> > will have a PCI bus, won't it? And it's pretty common to
> > have the IOMMU be a PCI device on the root bus.
> Having the IOMMU outside PCI seems more common though. On Arm and Intel
> the IOMMU doesn't have a PCI config space, BARs or capabilities, just a
> plain MMIO region and a static number of interrupts. However the AMD
> IOMMU appears as a PCI device with additional MMIO registers. I would be
> interested in implementing virtio-iommu as a PCI dev at some point, at
> least so that we can use MSI-X.
> 
> The problem is that it requires invasive changes in the firmware
> interfaces and drivers. They need to describe relationship between IOMMU
> and endpoint, and DT or ACPI IORT don't expect the programming interface
> to be inside the PCI bus that the IOMMU manages.

They don't particularly care IMHO.

> Describing it as a
> peripheral is more natural. For AMD it is implemented in their driver
> using IVHD tables that can't be reused. Right now I don't expect any
> success in proposing changes to firmware interfaces or drivers, because
> the device is new and paravirtualized, and works out of the box with
> MMIO. Hopefully that will change in the future, perhaps when someone
> supports DT for AMD IOMMU (they do describe bindings at the end of the
> spec, but I don't think it can work in Linux with the existing
> infrastructure)

That's a bit abstract, I don't really understand the issues involved.
ACPI is formatted by QEMU, so firmware does not need to care.
And is there even a DT for intel?

> Another reason to keep the MMIO transport option is that one
> virtio-iommu can manage DMA from endpoints on multiple PCI domains at
> the same time, as well as platform devices. Some VMMs might want that,
> in which case the IOMMU would be a separate platform device.

Which buses are managed by the IOMMU is separate from the bus
on which it's programming interface resides.

> > Will remove need to bother with dt bindings etc.
> That's handled by the firmware drivers and IOMMU layer, there shouldn't
> be any special treatment at the virtio layer. In general removal of an
> IOMMU needs to be done after removal of all endpoints connected to it,
> which should be described using device_link from the driver core. DT or
> ACPI is only used to tell drivers where to find resources, and to
> describe the DMA topology.
> 
> Thanks,
> Jean

My point was virtio mmio isn't widely used outside of ARM.
Rather than try to make everyone use it, IMHO it's better
to start with PCI.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v33 1/4] mm: add a function to get free page blocks

2018-06-27 Thread Michael S. Tsirkin
On Wed, Jun 27, 2018 at 09:05:39AM -0700, Linus Torvalds wrote:
> [ Sorry for slow reply, my travels have made a mess of my inbox ]
> 
> On Mon, Jun 25, 2018 at 6:55 PM Michael S. Tsirkin  wrote:
> >
> > Linus, do you think it would be ok to have get_from_free_page_list
> > actually pop entries from the free list and use them as the buffer
> > to store PAs?
> 
> Honestly, what I think the best option would be is to get rid of this
> interface *entirely*, and just have the balloon code do
> 
> #define GFP_MINFLAGS (__GFP_NORETRY | __GFP_NOWARN |
> __GFP_THISNODE | __GFP_NOMEMALLOC)
> 
> struct page *page =  alloc_pages(GFP_MINFLAGS, MAX_ORDER-1);
> 
>  which is not a new interface, and simply removes the max-order page
> from the list if at all possible.
> 
> The above has the advantage of "just working", and not having any races.
> 
> Now, because you don't want to necessarily *entirely* deplete the max
> order, I'd suggest that the *one* new interface you add is just a "how
> many max-order pages are there" interface. So then you can query
> (either before or after getting the max-order page) just how many of
> them there were and whether you want to give that page back.
> 
> Notice? No need for any page lists or physical addresses. No races. No
> complex new functions.
> 
> The physical address you can just get from the "struct page" you got.
> 
> And if you run out of memory because of getting a page, you get all
> the usual "hey, we ran out of memory" responses..
> 
> Wouldn't the above be sufficient?
> 
> Linus

I think so, thanks!

Wei, to put it in balloon terms, I think there's one thing we missed: if
you do manage to allocate a page, and you don't have a use for it, then
hey, you can just give it to the host because you know it's free - you
are going to return it to the free list.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 0/5] Add virtio-iommu driver

2018-06-27 Thread Jean-Philippe Brucker
On 26/06/18 19:07, Michael S. Tsirkin wrote:
> So as I pointed out new virtio 0 device isn't really welcome ;)

Agreed, virtio-iommu is expected to be implemented on virtio 1 and
later. I'll remove the two legacy-related paragraph from the spec and
add a check in the driver as you suggested, to avoid giving the wrong idea.

> No one bothered implementing virtio 1 in MMIO for all the work
> that was put in defining it.

That is curious, given that the virtio_mmio driver supports virtio 1 and
from my own experience, porting the MMIO transport to virtio 1 only
requires updating a few registers, when ANY_LAYOUT, modern virtqueue and
status are already implemented.

> The fact that the MMIO part of the
> spec doesn't seem to allow for transitional devices might
> or might not have something to do with this.

Sorry, which part do you have in mind? The spec does provide both a
legacy and modern register layout, with version numbers to differentiate
them.

> So why make it an MMIO device at all? A system with an IOMMU
> will have a PCI bus, won't it? And it's pretty common to
> have the IOMMU be a PCI device on the root bus.
Having the IOMMU outside PCI seems more common though. On Arm and Intel
the IOMMU doesn't have a PCI config space, BARs or capabilities, just a
plain MMIO region and a static number of interrupts. However the AMD
IOMMU appears as a PCI device with additional MMIO registers. I would be
interested in implementing virtio-iommu as a PCI dev at some point, at
least so that we can use MSI-X.

The problem is that it requires invasive changes in the firmware
interfaces and drivers. They need to describe relationship between IOMMU
and endpoint, and DT or ACPI IORT don't expect the programming interface
to be inside the PCI bus that the IOMMU manages. Describing it as a
peripheral is more natural. For AMD it is implemented in their driver
using IVHD tables that can't be reused. Right now I don't expect any
success in proposing changes to firmware interfaces or drivers, because
the device is new and paravirtualized, and works out of the box with
MMIO. Hopefully that will change in the future, perhaps when someone
supports DT for AMD IOMMU (they do describe bindings at the end of the
spec, but I don't think it can work in Linux with the existing
infrastructure)

Another reason to keep the MMIO transport option is that one
virtio-iommu can manage DMA from endpoints on multiple PCI domains at
the same time, as well as platform devices. Some VMMs might want that,
in which case the IOMMU would be a separate platform device.

> Will remove need to bother with dt bindings etc.
That's handled by the firmware drivers and IOMMU layer, there shouldn't
be any special treatment at the virtio layer. In general removal of an
IOMMU needs to be done after removal of all endpoints connected to it,
which should be described using device_link from the driver core. DT or
ACPI is only used to tell drivers where to find resources, and to
describe the DMA topology.

Thanks,
Jean
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/5] dt-bindings: virtio: Specify #iommu-cells value for a virtio-iommu

2018-06-27 Thread Rob Herring
On Tue, Jun 26, 2018 at 11:59 AM Jean-Philippe Brucker
 wrote:
>
> On 25/06/18 20:27, Rob Herring wrote:
> > On Thu, Jun 21, 2018 at 08:06:51PM +0100, Jean-Philippe Brucker wrote:
> >> A virtio-mmio node may represent a virtio-iommu device. This is discovered
> >> by the virtio driver at probe time, but the DMA topology isn't
> >> discoverable and must be described by firmware. For DT the standard IOMMU
> >> description is used, as specified in bindings/iommu/iommu.txt and
> >> bindings/pci/pci-iommu.txt. Like many other IOMMUs, virtio-iommu
> >> distinguishes masters by their endpoint IDs, which requires one IOMMU cell
> >> in the "iommus" property.
> >>
> >> Signed-off-by: Jean-Philippe Brucker 
> >> ---
> >>  Documentation/devicetree/bindings/virtio/mmio.txt | 8 
> >>  1 file changed, 8 insertions(+)
> >>
> >> diff --git a/Documentation/devicetree/bindings/virtio/mmio.txt 
> >> b/Documentation/devicetree/bindings/virtio/mmio.txt
> >> index 5069c1b8e193..337da0e3a87f 100644
> >> --- a/Documentation/devicetree/bindings/virtio/mmio.txt
> >> +++ b/Documentation/devicetree/bindings/virtio/mmio.txt
> >> @@ -8,6 +8,14 @@ Required properties:
> >>  - reg:  control registers base address and size including 
> >> configuration space
> >>  - interrupts:   interrupt generated by the device
> >>
> >> +Required properties for virtio-iommu:
> >> +
> >> +- #iommu-cells: When the node describes a virtio-iommu device, it is
> >> +linked to DMA masters using the "iommus" property as
> >> +described in devicetree/bindings/iommu/iommu.txt. For
> >> +virtio-iommu #iommu-cells must be 1, each cell describing
> >> +a single endpoint ID.
> >
> > The iommus property should also be documented for the client side.
>
> Isn't section "IOMMU master node" of iommu.txt sufficient? Since the
> iommus property applies to any DMA master, not only virtio-mmio devices,
> the canonical description in iommu.txt seems the best place for it, and
> I'm not sure what to add in this file. Maybe a short example below the
> virtio_block one?

No, because somewhere we have to capture if 'iommus' is valid for
'virtio-mmio' or not. Hopefully soon we'll actually be able to
validate that.

Rob
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [PATCH v34 2/4] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_HINT

2018-06-27 Thread Michael S. Tsirkin
On Wed, Jun 27, 2018 at 01:27:55PM +0800, Wei Wang wrote:
> On 06/27/2018 11:58 AM, Michael S. Tsirkin wrote:
> > On Wed, Jun 27, 2018 at 11:00:05AM +0800, Wei Wang wrote:
> > > On 06/27/2018 10:41 AM, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 27, 2018 at 09:24:18AM +0800, Wei Wang wrote:
> > > > > On 06/26/2018 09:34 PM, Michael S. Tsirkin wrote:
> > > > > > On Tue, Jun 26, 2018 at 08:27:44PM +0800, Wei Wang wrote:
> > > > > > > On 06/26/2018 11:56 AM, Michael S. Tsirkin wrote:
> > > > > > > > On Tue, Jun 26, 2018 at 11:46:35AM +0800, Wei Wang wrote:
> > > > > > > > 
> > > > > > > > > > > + if (!arrays)
> > > > > > > > > > > + return NULL;
> > > > > > > > > > > +
> > > > > > > > > > > + for (i = 0; i < max_array_num; i++) {
> > > > > > > > > > So we are getting a ton of memory here just to free it up a 
> > > > > > > > > > bit later.
> > > > > > > > > > Why doesn't get_from_free_page_list get the pages from free 
> > > > > > > > > > list for us?
> > > > > > > > > > We could also avoid the 1st allocation then - just build a 
> > > > > > > > > > list
> > > > > > > > > > of these.
> > > > > > > > > That wouldn't be a good choice for us. If we check how the 
> > > > > > > > > regular
> > > > > > > > > allocation works, there are many many things we need to 
> > > > > > > > > consider when pages
> > > > > > > > > are allocated to users.
> > > > > > > > > For example, we need to take care of the nr_free
> > > > > > > > > counter, we need to check the watermark and perform the 
> > > > > > > > > related actions.
> > > > > > > > > Also the folks working on arch_alloc_page to monitor page 
> > > > > > > > > allocation
> > > > > > > > > activities would get a surprise..if page allocation is 
> > > > > > > > > allowed to work in
> > > > > > > > > this way.
> > > > > > > > > 
> > > > > > > > mm/ code is well positioned to handle all this correctly.
> > > > > > > I'm afraid that would be a re-implementation of the alloc 
> > > > > > > functions,
> > > > > > A re-factoring - you can share code. The main difference is locking.
> > > > > > 
> > > > > > > and
> > > > > > > that would be much more complex than what we have. I think your 
> > > > > > > idea of
> > > > > > > passing a list of pages is better.
> > > > > > > 
> > > > > > > Best,
> > > > > > > Wei
> > > > > > How much memory is this allocating anyway?
> > > > > > 
> > > > > For every 2TB memory that the guest has, we allocate 4MB.
> > > > Hmm I guess I'm missing something, I don't see it:
> > > > 
> > > > 
> > > > +   max_entries = max_free_page_blocks(ARRAY_ALLOC_ORDER);
> > > > +   entries_per_page = PAGE_SIZE / sizeof(__le64);
> > > > +   entries_per_array = entries_per_page * (1 << ARRAY_ALLOC_ORDER);
> > > > +   max_array_num = max_entries / entries_per_array +
> > > > +   !!(max_entries % entries_per_array);
> > > > 
> > > > Looks like you always allocate the max number?
> > > Yes. We allocated the max number and then free what's not used.
> > > For example, a 16TB guest, we allocate Four 4MB buffers and pass the 4
> > > buffers to get_from_free_page_list. If it uses 3, then the remaining 1 
> > > "4MB
> > > buffer" will end up being freed.
> > > 
> > > For today's guests, max_array_num is usually 1.
> > > 
> > > Best,
> > > Wei
> > I see, it's based on total ram pages. It's reasonable but might
> > get out of sync if memory is onlined quickly. So you want to
> > detect that there's more free memory than can fit and
> > retry the reporting.
> > 
> 
> 
> - AFAIK, memory hotplug isn't expected to happen during live migration
> today. Hypervisors (e.g. QEMU) explicitly forbid this.

That's a temporary limitation.

> - Allocating buffers based on total ram pages already gives some headroom
> for newly plugged memory if that could happen in any case. Also, we can
> think about why people plug in more memory - usually because the existing
> memory isn't enough, which implies that the free page list is very likely to
> be close to empty.

Or maybe because guest is expected to use more memory.

> - This method could be easily scaled if people really need more headroom for
> hot-plugged memory. For example, calculation based on "X * total_ram_pages",
> X could be a number passed from the hypervisor.

All this in place of a simple retry loop within guest?

> - This is an optimization feature, and reporting less free memory in that
> rare case doesn't hurt anything.

People working on memory hotplug can't be expected to worry about
balloon. And maintainers have other things to do than debug hard to
trigger failure reports from the field.

> 
> So I think it is good to start from a fundamental implementation, which
> doesn't confuse people, and complexities can be added when there is a real
> need in the future.
> 
> Best,
> Wei

The usefulness of the whole patchset hasn't been proven in the field yet.
The more uncovered corner cases there are, the higher the chance that
it will turn out not to be useful after all.

> 

Re: [PATCH v33 1/4] mm: add a function to get free page blocks

2018-06-27 Thread Linus Torvalds
[ Sorry for slow reply, my travels have made a mess of my inbox ]

On Mon, Jun 25, 2018 at 6:55 PM Michael S. Tsirkin  wrote:
>
> Linus, do you think it would be ok to have get_from_free_page_list
> actually pop entries from the free list and use them as the buffer
> to store PAs?

Honestly, what I think the best option would be is to get rid of this
interface *entirely*, and just have the balloon code do

#define GFP_MINFLAGS (__GFP_NORETRY | __GFP_NOWARN |
__GFP_THISNODE | __GFP_NOMEMALLOC)

struct page *page =  alloc_pages(GFP_MINFLAGS, MAX_ORDER-1);

 which is not a new interface, and simply removes the max-order page
from the list if at all possible.

The above has the advantage of "just working", and not having any races.

Now, because you don't want to necessarily *entirely* deplete the max
order, I'd suggest that the *one* new interface you add is just a "how
many max-order pages are there" interface. So then you can query
(either before or after getting the max-order page) just how many of
them there were and whether you want to give that page back.

Notice? No need for any page lists or physical addresses. No races. No
complex new functions.

The physical address you can just get from the "struct page" you got.

And if you run out of memory because of getting a page, you get all
the usual "hey, we ran out of memory" responses..

Wouldn't the above be sufficient?

Linus
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net-next v2] net: vhost: improve performance when enable busyloop

2018-06-27 Thread Michael S. Tsirkin
On Wed, Jun 27, 2018 at 10:24:43PM +0800, Jason Wang wrote:
> 
> 
> On 2018年06月26日 13:17, xiangxia.m@gmail.com wrote:
> > From: Tonghao Zhang 
> > 
> > This patch improves the guest receive performance from
> > host. On the handle_tx side, we poll the sock receive
> > queue at the same time. handle_rx do that in the same way.
> > 
> > For avoiding deadlock, change the code to lock the vq one
> > by one and use the VHOST_NET_VQ_XX as a subclass for
> > mutex_lock_nested. With the patch, qemu can set differently
> > the busyloop_timeout for rx or tx queue.
> > 
> > We set the poll-us=100us and use the iperf3 to test
> > its throughput. The iperf3 command is shown as below.
> > 
> > on the guest:
> > iperf3  -s -D
> > 
> > on the host:
> > iperf3  -c 192.168.1.100 -i 1 -P 10 -t 10 -M 1400
> > 
> > * With the patch: 23.1 Gbits/sec
> > * Without the patch:  12.7 Gbits/sec
> > 
> > Signed-off-by: Tonghao Zhang 
> 
> Thanks a lot for the patch. Looks good generally, but please split this big
> patch into separate ones like:
> 
> patch 1: lock vqs one by one
> patch 2: replace magic number of lock annotation
> patch 3: factor out generic busy polling logic to vhost_net_busy_poll()
> patch 4: add rx busy polling in tx path.
> 
> And please cc Michael in v3.
> 
> Thanks

Pls include host CPU utilization numbers. You can get them e.g. using
vmstat. I suspect we also want the polling controllable e.g. through
an ioctl.

-- 
MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next v2] net: vhost: improve performance when enable busyloop

2018-06-27 Thread Jason Wang



On 2018年06月26日 13:17, xiangxia.m@gmail.com wrote:

From: Tonghao Zhang 

This patch improves the guest receive performance from
host. On the handle_tx side, we poll the sock receive
queue at the same time. handle_rx do that in the same way.

For avoiding deadlock, change the code to lock the vq one
by one and use the VHOST_NET_VQ_XX as a subclass for
mutex_lock_nested. With the patch, qemu can set differently
the busyloop_timeout for rx or tx queue.

We set the poll-us=100us and use the iperf3 to test
its throughput. The iperf3 command is shown as below.

on the guest:
iperf3  -s -D

on the host:
iperf3  -c 192.168.1.100 -i 1 -P 10 -t 10 -M 1400

* With the patch: 23.1 Gbits/sec
* Without the patch:  12.7 Gbits/sec

Signed-off-by: Tonghao Zhang 


Thanks a lot for the patch. Looks good generally, but please split this 
big patch into separate ones like:


patch 1: lock vqs one by one
patch 2: replace magic number of lock annotation
patch 3: factor out generic busy polling logic to vhost_net_busy_poll()
patch 4: add rx busy polling in tx path.

And please cc Michael in v3.

Thanks
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v34 0/4] Virtio-balloon: support free page reporting

2018-06-27 Thread David Hildenbrand
On 25.06.2018 14:05, Wei Wang wrote:
> This patch series is separated from the previous "Virtio-balloon
> Enhancement" series. The new feature, VIRTIO_BALLOON_F_FREE_PAGE_HINT,  
> implemented by this series enables the virtio-balloon driver to report
> hints of guest free pages to the host. It can be used to accelerate live
> migration of VMs. Here is an introduction of this usage:
> 
> Live migration needs to transfer the VM's memory from the source machine
> to the destination round by round. For the 1st round, all the VM's memory
> is transferred. From the 2nd round, only the pieces of memory that were
> written by the guest (after the 1st round) are transferred. One method
> that is popularly used by the hypervisor to track which part of memory is
> written is to write-protect all the guest memory.
> 
> This feature enables the optimization by skipping the transfer of guest
> free pages during VM live migration. It is not concerned that the memory
> pages are used after they are given to the hypervisor as a hint of the
> free pages, because they will be tracked by the hypervisor and transferred
> in the subsequent round if they are used and written.
> 
> * Tests
> - Test Environment
> Host: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
> Guest: 8G RAM, 4 vCPU
> Migration setup: migrate_set_speed 100G, migrate_set_downtime 2 second
> 
> - Test Results
> - Idle Guest Live Migration Time (results are averaged over 10 runs):
> - Optimization v.s. Legacy = 284ms vs 1757ms --> ~84% reduction
> - Guest with Linux Compilation Workload (make bzImage -j4):
> - Live Migration Time (average)
>   Optimization v.s. Legacy = 1402ms v.s. 2528ms --> ~44% reduction
> - Linux Compilation Time
>   Optimization v.s. Legacy = 5min6s v.s. 5min12s
>   --> no obvious difference
> 

Being in version 34 already, this whole thing still looks and feels like
a big hack to me. It might just be me, but especially if I read about
assumptions like "QEMU will not hotplug memory during migration". This
does not feel like a clean solution.

I am still not sure if we really need this interface, especially as real
free page hinting might be on its way.

a) we perform free page hinting by setting all free pages
(arch_free_page()) to zero. Migration will detect zero pages and
minimize #pages to migrate. I don't think this is a good idea but Michel
suggested to do a performance evaluation and Nitesh is looking into that
right now.

b) we perform free page hinting using something that Nitesh proposed. We
get in QEMU blocks of free pages that we can MADV_FREE. In addition we
could e.g. clear the dirty bit of these pages in the dirty bitmap, to
hinder them from getting migrated. Right now the hinting mechanism is
synchronous (called from arch_free_page()) but we might be able to
convert it into something asynchronous.

So we might be able to completely get rid of this interface. And looking
at all the discussions and problems that already happened during the
development of this series, I think we should rather look into how clean
free page hinting might solve the same problem.

If it can't be solved using free page hinting, fair enough.


> ChangeLog:
> v33->v34:
> - mm:
> - add a new API max_free_page_blocks, which estimates the max
>   number of free page blocks that a free page list may have
> - get_from_free_page_list: store addresses to multiple arrays,
>   instead of just one array. This removes the limitation of being
>   able to report only 2TB free memory (the largest array memory
>   that can be allocated on x86 is 4MB, which can store 2^19
>   addresses of 4MB free page blocks).
> - virtio-balloon:
> - Allocate multiple arrays to load free page hints;
> - Use the same method in v32 to do guest/host interaction, the
>   differeces are
>   - the hints are tranferred array by array, instead of
> one by one.
> - send the free page block size of a hint along with the cmd
> id to host, so that host knows each address represents e.g.
> a 4MB memory in our case. 
> v32->v33:
> - mm/get_from_free_page_list: The new implementation to get free page
>   hints based on the suggestions from Linus:
>   https://lkml.org/lkml/2018/6/11/764
>   This avoids the complex call chain, and looks more prudent.
> - virtio-balloon: 
>   - use a fix-sized buffer to get free page hints;
>   - remove the cmd id related interface. Now host can just send a free
> page hint command to the guest (via the host_cmd config register)
> to start the reporting. Currentlty the guest reports only the max
> order free page hints to host, which has generated similar good
> results as before. But the interface used by virtio-balloon to
> report can support reporting more orders in the future 

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-27 Thread Cornelia Huck
On Sat, 23 Jun 2018 00:43:24 +0300
"Michael S. Tsirkin"  wrote:

> On Fri, Jun 22, 2018 at 05:09:55PM +0200, Cornelia Huck wrote:
> > Would it be more helpful to focus on generic
> > migration support for vfio instead of going about it device by device?  
> 
> Just to note this approach is actually device by device *type*.  It's
> mostly device agnostic for a given device type so you can migrate
> between hosts with very different hardware.

This enables heterogeneous environments, yes.

But one drawback of that is that you cannot exploit any hardware
specialities - it seems you're limited to what the paravirtual device
supports. This is limiting for more homogeneous environments.

> 
> And support for more PV device types has other advantages
> such as security and forward compatibility to future hosts.

But again the drawback is that we can't exploit new capabilities
easily, can we?

> 
> Finally, it all can happen mostly within QEMU. User is currently
> required to enable it but it's pretty lightweight.
> 
> OTOH vfio migration generally requires actual device-specific work, and
> only works when hosts are mostly identical. When they aren't it's easy
> to blame the user, but tools for checking host compatiblity are
> currently non-existent. Upper layer management will also have to learn
> about host and device compatibility wrt migration. At the moment they
> can't even figure it out wrt software versions of vhost in kernel and
> dpdk so I won't hold my breath for all of this happening quickly.

Yes, that's a real problem.

I think one issue here is that we want to support really different
environments. For the case here, we have a lot of different networking
adapters, but the guests are basically interested in one thing: doing
network traffic. On the other hand, I'm thinking of the mainframe
environment, where we have a very limited set of devices to support,
but at the same time want to exploit their specialities, so the pv
approach is limiting. For that use case, generic migration looks more
useful.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-27 Thread Cornelia Huck
On Tue, 26 Jun 2018 20:50:20 +0300
"Michael S. Tsirkin"  wrote:

> On Tue, Jun 26, 2018 at 05:08:13PM +0200, Cornelia Huck wrote:
> > On Fri, 22 Jun 2018 17:05:04 -0700
> > Siwei Liu  wrote:
> >   
> > > On Fri, Jun 22, 2018 at 3:33 PM, Michael S. Tsirkin  
> > > wrote:  
> > > > I suspect the diveregence will be lost on most users though
> > > > simply because they don't even care about vfio. They just
> > > > want things to go fast.
> > > 
> > > Like Jason said, VF isn't faster than virtio-net in all cases. It
> > > depends on the workload and performance metrics: throughput, latency,
> > > or packet per second.  
> > 
> > So, will it be guest/admin-controllable then where the traffic flows
> > through? Just because we do have a vf available after negotiation of
> > the feature bit, it does not necessarily mean we want to use it? Do we
> > (the guest) even want to make it visible in that case?  
> 
> I think these ideas belong to what Alex Duyck wanted to do:
> some kind of advanced device that isn't tied to
> any network interfaces and allows workload and performance
> specific tuning.
> 
> Way out of scope for a simple failover, and more importantly,
> no one is looking at even enumerating the problems involved,
> much less solving them.

So, for simplicity's sake, we need to rely on the host admin
configuring the vm for its guest's intended use case. Sounds fair, but
probably needs a note somewhere.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-27 Thread Siwei Liu
On Tue, Jun 26, 2018 at 11:49 PM, Samudrala, Sridhar
 wrote:
> On 6/26/2018 11:21 PM, Siwei Liu wrote:
>>
>> On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin 
>> wrote:
>>>
>>> On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:

 On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin 
 wrote:
>
> On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>>
>> Might not neccessarily be something wrong, but it's very limited
>> to
>> prohibit the MAC of VF from changing when enslaved by failover.
>
> You mean guest changing MAC? I'm not sure why we prohibit that.

 I think Sridhar and Jiri might be better person to answer it. My
 impression was that sync'ing the MAC address change between all 3
 devices is challenging, as the failover driver uses MAC address to
 match net_device internally.
>>
>> Yes. The MAC address is assigned by the hypervisor and it needs to
>> manage the movement
>> of the MAC between the PF and VF.  Allowing the guest to change the
>> MAC will require
>> synchronization between the hypervisor and the PF/VF drivers. Most of
>> the VF drivers
>> don't allow changing guest MAC unless it is a trusted VF.
>
> OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
> For example I can see host just
> failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
> I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.

 That's why I think pairing using MAC is fragile IMHO. When VF's MAC
 got changed before virtio attempts to match and pair the device, it
 ends up with no pairing found out at all.
>>>
>>> Guest seems to match on the hardware mac and ignore whatever
>>> is set by user. Makes sense to me and should not be fragile.
>>
>> Host can change the hardware mac for VF any time.
>
>
> Live migration is initiated and controlled by the Host,  So the Source Host
> will
> reset the MAC during live migration after unplugging the VF. This is to
> redirect the
> VMs frames towards PF so that they can be received via virtio-net standby
> interface.
> The destination host will set the VF MAC and plug the VF after live
> migration is
> completed.
>
> Allowing the guest to change the MAC will require the qemu/libvirt/mgmt
> layers to
> track the MAC changes and replay that change after live migration.
>
If the failover's MAC is kept in sync with VF's MAC address change,
the VF on destination host can be paired using the permanent address
after plugging in, while failover interface will resync the MAC to the
current one in use when enslaving the VF. I think similar is done for
multicast and unicast address list on VF's registration, right? No
need of QEMU or mgmt software keep track of MAC changes.

-Siwei

>
>
>>
>> -Siwei
>>>
>>>
 UUID is better.

 -Siwei

> --
> MST
>
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-27 Thread Samudrala, Sridhar

On 6/26/2018 11:21 PM, Siwei Liu wrote:

On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin  wrote:

On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:

On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin  wrote:

On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:

Might not neccessarily be something wrong, but it's very limited to
prohibit the MAC of VF from changing when enslaved by failover.

You mean guest changing MAC? I'm not sure why we prohibit that.

I think Sridhar and Jiri might be better person to answer it. My
impression was that sync'ing the MAC address change between all 3
devices is challenging, as the failover driver uses MAC address to
match net_device internally.

Yes. The MAC address is assigned by the hypervisor and it needs to manage the 
movement
of the MAC between the PF and VF.  Allowing the guest to change the MAC will 
require
synchronization between the hypervisor and the PF/VF drivers. Most of the VF 
drivers
don't allow changing guest MAC unless it is a trusted VF.

OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
For example I can see host just
failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.

That's why I think pairing using MAC is fragile IMHO. When VF's MAC
got changed before virtio attempts to match and pair the device, it
ends up with no pairing found out at all.

Guest seems to match on the hardware mac and ignore whatever
is set by user. Makes sense to me and should not be fragile.

Host can change the hardware mac for VF any time.


Live migration is initiated and controlled by the Host,  So the Source Host will
reset the MAC during live migration after unplugging the VF. This is to 
redirect the
VMs frames towards PF so that they can be received via virtio-net standby 
interface.
The destination host will set the VF MAC and plug the VF after live migration is
completed.

Allowing the guest to change the MAC will require the qemu/libvirt/mgmt layers 
to
track the MAC changes and replay that change after live migration.




-Siwei



UUID is better.

-Siwei


--
MST


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net

2018-06-27 Thread Siwei Liu
On Tue, Jun 26, 2018 at 5:29 PM, Michael S. Tsirkin  wrote:
> On Tue, Jun 26, 2018 at 04:38:26PM -0700, Siwei Liu wrote:
>> On Mon, Jun 25, 2018 at 6:50 PM, Michael S. Tsirkin  wrote:
>> > On Mon, Jun 25, 2018 at 10:54:09AM -0700, Samudrala, Sridhar wrote:
>> >> > > > > Might not neccessarily be something wrong, but it's very limited 
>> >> > > > > to
>> >> > > > > prohibit the MAC of VF from changing when enslaved by failover.
>> >> > > > You mean guest changing MAC? I'm not sure why we prohibit that.
>> >> > > I think Sridhar and Jiri might be better person to answer it. My
>> >> > > impression was that sync'ing the MAC address change between all 3
>> >> > > devices is challenging, as the failover driver uses MAC address to
>> >> > > match net_device internally.
>> >>
>> >> Yes. The MAC address is assigned by the hypervisor and it needs to manage 
>> >> the movement
>> >> of the MAC between the PF and VF.  Allowing the guest to change the MAC 
>> >> will require
>> >> synchronization between the hypervisor and the PF/VF drivers. Most of the 
>> >> VF drivers
>> >> don't allow changing guest MAC unless it is a trusted VF.
>> >
>> > OK but it's a policy thing. Maybe it's a trusted VF. Who knows?
>> > For example I can see host just
>> > failing VIRTIO_NET_CTRL_MAC_ADDR_SET if it wants to block it.
>> > I'm not sure why VIRTIO_NET_F_STANDBY has to block it in the guest.
>>
>> That's why I think pairing using MAC is fragile IMHO. When VF's MAC
>> got changed before virtio attempts to match and pair the device, it
>> ends up with no pairing found out at all.
>
> Guest seems to match on the hardware mac and ignore whatever
> is set by user. Makes sense to me and should not be fragile.
Host can change the hardware mac for VF any time.

-Siwei
>
>
>> UUID is better.
>>
>> -Siwei
>>
>> >
>> > --
>> > MST
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization