Re: [Qemu-devel] live migration vs device assignment (motivation)
On 2015年12月30日 00:46, Michael S. Tsirkin wrote: > Interesting. So you sare saying merely ifdown/ifup is 100ms? > This does not sound reasonable. > Is there a chance you are e.g. getting IP from dhcp? > > If so that is wrong - clearly should reconfigure the old IP > back without playing with dhcp. For testing, just set up > a static IP. MAC and IP are migrated with VM to target machine and not need to reconfigure IP after migration. >From my test result, ixgbevf_down() consumes 35ms and ixgbevf_up() consumes 55ms during migration. -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Tue, Dec 29, 2015 at 9:15 AM, Michael S. Tsirkin wrote: > On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote: >> On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin wrote: >> > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote: >> >> >> >> >> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote: >> >> >As long as you keep up this vague talk about performance during >> >> >migration, without even bothering with any measurements, this patchset >> >> >will keep going nowhere. >> >> > >> >> >> >> I measured network service downtime for "keep device alive"(RFC patch V1 >> >> presented) and "put down and up network interface"(RFC patch V2 presented) >> >> during migration with some optimizations. >> >> >> >> The former is around 140ms and the later is around 240ms. >> >> >> >> My patchset relies on the maibox irq which doesn't work in the suspend >> >> state >> >> and so can't get downtime for suspend/resume cases. Will try to get the >> >> result later. >> > >> > >> > Interesting. So you sare saying merely ifdown/ifup is 100ms? >> > This does not sound reasonable. >> > Is there a chance you are e.g. getting IP from dhcp? >> >> >> Actually it wouldn't surprise me if that is due to a reset logic in >> the driver. For starters there is a 10 msec delay in the call >> ixgbevf_reset_hw_vf which I believe is present to allow the PF time to >> clear registers after the VF has requested a reset. There is also a >> 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues >> were disabled. That is in addition to the fact that the function that >> disables the queues does so serially and polls each queue until the >> hardware acknowledges that the queues are actually disabled. The >> driver also does the serial enable with poll logic on re-enabling the >> queues which likely doesn't help things. >> >> Really this driver is probably in need of a refactor to clean the >> cruft out of the reset and initialization logic. I suspect we have >> far more delays than we really need and that is the source of much of >> the slow down. > > For ifdown, why is there any need to reset the device at all? > Is it so buffers can be reclaimed? > I believe it is mostly historical. All the Intel drivers are derived from e1000. The e1000 has a 10ms sleep to allow outstanding PCI transactions to complete before resetting and it looks like they ended up inheriting that in the ixgbevf driver. I suppose it does allow for the buffers to be reclaimed which is something we may need, though the VF driver should have already verified that it disabled the queues when it was doing the polling on the bits being cleared in the individual queue control registers. Likely the 10ms sleep is redundant as a result. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Tue, Dec 29, 2015 at 09:04:51AM -0800, Alexander Duyck wrote: > On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin wrote: > > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote: > >> > >> > >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote: > >> >As long as you keep up this vague talk about performance during > >> >migration, without even bothering with any measurements, this patchset > >> >will keep going nowhere. > >> > > >> > >> I measured network service downtime for "keep device alive"(RFC patch V1 > >> presented) and "put down and up network interface"(RFC patch V2 presented) > >> during migration with some optimizations. > >> > >> The former is around 140ms and the later is around 240ms. > >> > >> My patchset relies on the maibox irq which doesn't work in the suspend > >> state > >> and so can't get downtime for suspend/resume cases. Will try to get the > >> result later. > > > > > > Interesting. So you sare saying merely ifdown/ifup is 100ms? > > This does not sound reasonable. > > Is there a chance you are e.g. getting IP from dhcp? > > > Actually it wouldn't surprise me if that is due to a reset logic in > the driver. For starters there is a 10 msec delay in the call > ixgbevf_reset_hw_vf which I believe is present to allow the PF time to > clear registers after the VF has requested a reset. There is also a > 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues > were disabled. That is in addition to the fact that the function that > disables the queues does so serially and polls each queue until the > hardware acknowledges that the queues are actually disabled. The > driver also does the serial enable with poll logic on re-enabling the > queues which likely doesn't help things. > > Really this driver is probably in need of a refactor to clean the > cruft out of the reset and initialization logic. I suspect we have > far more delays than we really need and that is the source of much of > the slow down. > > - Alex For ifdown, why is there any need to reset the device at all? Is it so buffers can be reclaimed? -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Tue, Dec 29, 2015 at 8:46 AM, Michael S. Tsirkin wrote: > On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote: >> >> >> On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote: >> >As long as you keep up this vague talk about performance during >> >migration, without even bothering with any measurements, this patchset >> >will keep going nowhere. >> > >> >> I measured network service downtime for "keep device alive"(RFC patch V1 >> presented) and "put down and up network interface"(RFC patch V2 presented) >> during migration with some optimizations. >> >> The former is around 140ms and the later is around 240ms. >> >> My patchset relies on the maibox irq which doesn't work in the suspend state >> and so can't get downtime for suspend/resume cases. Will try to get the >> result later. > > > Interesting. So you sare saying merely ifdown/ifup is 100ms? > This does not sound reasonable. > Is there a chance you are e.g. getting IP from dhcp? Actually it wouldn't surprise me if that is due to a reset logic in the driver. For starters there is a 10 msec delay in the call ixgbevf_reset_hw_vf which I believe is present to allow the PF time to clear registers after the VF has requested a reset. There is also a 10 to 20 msec sleep in ixgbevf_down which occurs after the Rx queues were disabled. That is in addition to the fact that the function that disables the queues does so serially and polls each queue until the hardware acknowledges that the queues are actually disabled. The driver also does the serial enable with poll logic on re-enabling the queues which likely doesn't help things. Really this driver is probably in need of a refactor to clean the cruft out of the reset and initialization logic. I suspect we have far more delays than we really need and that is the source of much of the slow down. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Tue, Dec 29, 2015 at 01:42:14AM +0800, Lan, Tianyu wrote: > > > On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote: > >As long as you keep up this vague talk about performance during > >migration, without even bothering with any measurements, this patchset > >will keep going nowhere. > > > > I measured network service downtime for "keep device alive"(RFC patch V1 > presented) and "put down and up network interface"(RFC patch V2 presented) > during migration with some optimizations. > > The former is around 140ms and the later is around 240ms. > > My patchset relies on the maibox irq which doesn't work in the suspend state > and so can't get downtime for suspend/resume cases. Will try to get the > result later. Interesting. So you sare saying merely ifdown/ifup is 100ms? This does not sound reasonable. Is there a chance you are e.g. getting IP from dhcp? If so that is wrong - clearly should reconfigure the old IP back without playing with dhcp. For testing, just set up a static IP. > > > > > > > >There's Alex's patch that tracks memory changes during migration. It > >needs some simple enhancements to be useful in production (e.g. add a > >host/guest handshake to both enable tracking in guest and to detect the > >support in host), then it can allow starting migration with an assigned > >device, by invoking hot-unplug after most of memory have been migrated. > > > >Please implement this in qemu and measure the speed. > > Sure. Will do that. > > >I will not be surprised if destroying/creating netdev in linux > >turns out to take too long, but before anyone bothered > >checking, it does not make sense to discuss further enhancements. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/25/2015 8:11 PM, Michael S. Tsirkin wrote: As long as you keep up this vague talk about performance during migration, without even bothering with any measurements, this patchset will keep going nowhere. I measured network service downtime for "keep device alive"(RFC patch V1 presented) and "put down and up network interface"(RFC patch V2 presented) during migration with some optimizations. The former is around 140ms and the later is around 240ms. My patchset relies on the maibox irq which doesn't work in the suspend state and so can't get downtime for suspend/resume cases. Will try to get the result later. There's Alex's patch that tracks memory changes during migration. It needs some simple enhancements to be useful in production (e.g. add a host/guest handshake to both enable tracking in guest and to detect the support in host), then it can allow starting migration with an assigned device, by invoking hot-unplug after most of memory have been migrated. Please implement this in qemu and measure the speed. Sure. Will do that. I will not be surprised if destroying/creating netdev in linux turns out to take too long, but before anyone bothered checking, it does not make sense to discuss further enhancements. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Mon, Dec 28, 2015 at 11:52:43AM +0300, Pavel Fedin wrote: > Hello! > > > A dedicated IRQ per device for something that is a system wide event > > sounds like a waste. I don't understand why a spec change is strictly > > required, we only need to support this with the specific virtual bridge > > used by QEMU, so I think that a vendor specific capability will do. > > Once this works well in the field, a PCI spec ECN might make sense > > to standardise the capability. > > Keeping track of your discussion for some time, decided to jump in... > So far, we want to have some kind of mailbox to notify the quest about > migration. So what about some dedicated "pci device" for > this purpose? Some kind of "migration controller". This is: > a) perhaps easier to implement than capability, we don't need to push > anything to PCI spec. > b) could easily make friendship with Windows, because this means that no bus > code has to be touched at all. It would rely only on > drivers' ability to communicate with each other (i guess it should be > possible in Windows, isn't it?) > c) does not need to steal resources (BARs, IRQs, etc) from the actual devices. > > Kind regards, > Pavel Fedin > Expert Engineer > Samsung Electronics Research center Russia > Sure, or we can use an ACPI device. It doesn't really matter what we do for the mailbox. Whoever writes this first will get to select a mechanism. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Mon, Dec 28, 2015 at 03:20:10AM +, Dong, Eddie wrote: > > > > > > Even if the device driver doesn't support migration, you still want to > > > migrate VM? That maybe risk and we should add the "bad path" for the > > > driver at least. > > > > At a minimum we should have support for hot-plug if we are expecting to > > support migration. You would simply have to hot-plug the device before you > > start migration and then return it after. That is how the current bonding > > approach for this works if I am not mistaken. > > Hotplug is good to eliminate the device spefic state clone, but > bonding approach is very network specific, it doesn’t work for other > devices such as FPGA device, QaT devices & GPU devices, which we plan > to support gradually :) Alexander didn't say do bonding. He just said bonding uses hot-unplug. Gradual and generic is the correct approach. So focus on splitting the work into manageable pieces which are also useful by themselves, and generally reusable by different devices. So live the pausing alone for a moment. Start from Alexander's patchset for tracking dirty memory, add a way to control and detect it from userspace (and maybe from host), and a way to start migration while device is attached, removing it at the last possible moment. That will be a nice first step. > > > > The advantage we are looking to gain is to avoid removing/disabling the > > device for as long as possible. Ideally we want to keep the device active > > through the warm-up period, but if the guest doesn't do that we should still > > be able to fall back on the older approaches if needed. > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Qemu-devel] live migration vs device assignment (motivation)
Hello! > A dedicated IRQ per device for something that is a system wide event > sounds like a waste. I don't understand why a spec change is strictly > required, we only need to support this with the specific virtual bridge > used by QEMU, so I think that a vendor specific capability will do. > Once this works well in the field, a PCI spec ECN might make sense > to standardise the capability. Keeping track of your discussion for some time, decided to jump in... So far, we want to have some kind of mailbox to notify the quest about migration. So what about some dedicated "pci device" for this purpose? Some kind of "migration controller". This is: a) perhaps easier to implement than capability, we don't need to push anything to PCI spec. b) could easily make friendship with Windows, because this means that no bus code has to be touched at all. It would rely only on drivers' ability to communicate with each other (i guess it should be possible in Windows, isn't it?) c) does not need to steal resources (BARs, IRQs, etc) from the actual devices. Kind regards, Pavel Fedin Expert Engineer Samsung Electronics Research center Russia -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Sun, Dec 27, 2015 at 01:45:15PM -0800, Alexander Duyck wrote: > On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin wrote: > > On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote: > >> The PCI hot-plug specification calls out that the OS can optionally > >> implement a "pause" mechanism which is meant to be used for high > >> availability type environments. What I am proposing is basically > >> extending the standard SHPC capable PCI bridge so that we can support > >> the DMA page dirtying for everything hosted on it, add a vendor > >> specific block to the config space so that the guest can notify the > >> host that it will do page dirtying, and add a mechanism to indicate > >> that all hot-plug events during the warm-up phase of the migration are > >> pause events instead of full removals. > > > > Two comments: > > > > 1. A vendor specific capability will always be problematic. > > Better to register a capability id with pci sig. > > > > 2. There are actually several capabilities: > > > > A. support for memory dirtying > > if not supported, we must stop device before migration > > > > This is supported by core guest OS code, > > using patches similar to posted by you. > > > > > > B. support for device replacement > > This is a faster form of hotplug, where device is removed and > > later another device using same driver is inserted in the same slot. > > > > This is a possible optimization, but I am convinced > > (A) should be implemented independently of (B). > > > > My thought on this was that we don't need much to really implement > either feature. Really only a bit or two for either one. I had > thought about extending the PCI Advanced Features, but for now it > might make more sense to just implement it as a vendor capability for > the QEMU based bridges instead of trying to make this a true PCI > capability since I am not sure if this in any way would apply to > physical hardware. The fact is the PCI Advanced Features capability > is essentially just a vendor specific capability with a different ID Interesting. I see it more as a backport of pci express features to pci. > so if we were to use 2 bits that are currently reserved in the > capability we could later merge the functionality without much > overhead. Don't do this. You must not touch reserved bits. > I fully agree that the two implementations should be separate but > nothing says we have to implement them completely different. If we > are just using 3 bits for capability, status, and control of each > feature there is no reason for them to need to be stored in separate > locations. True. > >> I've been poking around in the kernel and QEMU code and the part I > >> have been trying to sort out is how to get QEMU based pci-bridge to > >> use the SHPC driver because from what I can tell the driver never > >> actually gets loaded on the device as it is left in the control of > >> ACPI hot-plug. > > > > There are ways, but you can just use pci express, it's easier. > > That's true. I should probably just give up on trying to do an > implementation that works with the i440fx implementation. I could > probably move over to the q35 and once that is done then we could look > at something like the PCI Advanced Features solution for something > like the PCI-bridge drivers. > > - Alex Once we have a decent idea of what's required, I can write an ECN for pci code and id assignment specification. That's cleaner than vendor specific stuff that's tied to a specific device/vendor ID. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Sun, Dec 27, 2015 at 7:20 PM, Dong, Eddie wrote: >> > >> > Even if the device driver doesn't support migration, you still want to >> > migrate VM? That maybe risk and we should add the "bad path" for the >> > driver at least. >> >> At a minimum we should have support for hot-plug if we are expecting to >> support migration. You would simply have to hot-plug the device before you >> start migration and then return it after. That is how the current bonding >> approach for this works if I am not mistaken. > > Hotplug is good to eliminate the device spefic state clone, but bonding > approach is very network specific, it doesn’t work for other devices such as > FPGA device, QaT devices & GPU devices, which we plan to support gradually :) Hotplug would be usable for that assuming the guest supports the optional "pause" implementation as called out in the PCI hotplug spec. With that the device can maintain state for some period of time after the hotplug remove event has occurred. The problem is that you have to get the device to quiesce at some point as you cannot complete the migration with the device still active. The way you were doing it was using the per-device configuration space mechanism. That doesn't scale when you have to implement it for each and every driver for each and every OS you have to support. Using the "pause" implementation for hot-plug would have a much greater likelihood of scaling as you could either take the fast path approach of "pausing" the device to resume it when migration has completed, or you could just remove the device and restart the driver on the other side if the pause support is not yet implemented. You would lose the state under such a migration but it is much more practical than having to implement a per device solution. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [Qemu-devel] live migration vs device assignment (motivation)
> > > > Even if the device driver doesn't support migration, you still want to > > migrate VM? That maybe risk and we should add the "bad path" for the > > driver at least. > > At a minimum we should have support for hot-plug if we are expecting to > support migration. You would simply have to hot-plug the device before you > start migration and then return it after. That is how the current bonding > approach for this works if I am not mistaken. Hotplug is good to eliminate the device spefic state clone, but bonding approach is very network specific, it doesn’t work for other devices such as FPGA device, QaT devices & GPU devices, which we plan to support gradually :) > > The advantage we are looking to gain is to avoid removing/disabling the > device for as long as possible. Ideally we want to keep the device active > through the warm-up period, but if the guest doesn't do that we should still > be able to fall back on the older approaches if needed. > N�r��yb�X��ǧv�^�){.n�+h����ܨ}���Ơz�&j:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Sun, Dec 27, 2015 at 1:21 AM, Michael S. Tsirkin wrote: > On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote: >> The PCI hot-plug specification calls out that the OS can optionally >> implement a "pause" mechanism which is meant to be used for high >> availability type environments. What I am proposing is basically >> extending the standard SHPC capable PCI bridge so that we can support >> the DMA page dirtying for everything hosted on it, add a vendor >> specific block to the config space so that the guest can notify the >> host that it will do page dirtying, and add a mechanism to indicate >> that all hot-plug events during the warm-up phase of the migration are >> pause events instead of full removals. > > Two comments: > > 1. A vendor specific capability will always be problematic. > Better to register a capability id with pci sig. > > 2. There are actually several capabilities: > > A. support for memory dirtying > if not supported, we must stop device before migration > > This is supported by core guest OS code, > using patches similar to posted by you. > > > B. support for device replacement > This is a faster form of hotplug, where device is removed and > later another device using same driver is inserted in the same slot. > > This is a possible optimization, but I am convinced > (A) should be implemented independently of (B). > My thought on this was that we don't need much to really implement either feature. Really only a bit or two for either one. I had thought about extending the PCI Advanced Features, but for now it might make more sense to just implement it as a vendor capability for the QEMU based bridges instead of trying to make this a true PCI capability since I am not sure if this in any way would apply to physical hardware. The fact is the PCI Advanced Features capability is essentially just a vendor specific capability with a different ID so if we were to use 2 bits that are currently reserved in the capability we could later merge the functionality without much overhead. I fully agree that the two implementations should be separate but nothing says we have to implement them completely different. If we are just using 3 bits for capability, status, and control of each feature there is no reason for them to need to be stored in separate locations. >> I've been poking around in the kernel and QEMU code and the part I >> have been trying to sort out is how to get QEMU based pci-bridge to >> use the SHPC driver because from what I can tell the driver never >> actually gets loaded on the device as it is left in the control of >> ACPI hot-plug. > > There are ways, but you can just use pci express, it's easier. That's true. I should probably just give up on trying to do an implementation that works with the i440fx implementation. I could probably move over to the q35 and once that is done then we could look at something like the PCI Advanced Features solution for something like the PCI-bridge drivers. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Fri, Dec 25, 2015 at 02:31:14PM -0800, Alexander Duyck wrote: > The PCI hot-plug specification calls out that the OS can optionally > implement a "pause" mechanism which is meant to be used for high > availability type environments. What I am proposing is basically > extending the standard SHPC capable PCI bridge so that we can support > the DMA page dirtying for everything hosted on it, add a vendor > specific block to the config space so that the guest can notify the > host that it will do page dirtying, and add a mechanism to indicate > that all hot-plug events during the warm-up phase of the migration are > pause events instead of full removals. Two comments: 1. A vendor specific capability will always be problematic. Better to register a capability id with pci sig. 2. There are actually several capabilities: A. support for memory dirtying if not supported, we must stop device before migration This is supported by core guest OS code, using patches similar to posted by you. B. support for device replacement This is a faster form of hotplug, where device is removed and later another device using same driver is inserted in the same slot. This is a possible optimization, but I am convinced (A) should be implemented independently of (B). > I've been poking around in the kernel and QEMU code and the part I > have been trying to sort out is how to get QEMU based pci-bridge to > use the SHPC driver because from what I can tell the driver never > actually gets loaded on the device as it is left in the control of > ACPI hot-plug. > > - Alex There are ways, but you can just use pci express, it's easier. -- MST -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Thu, Dec 24, 2015 at 11:03 PM, Lan Tianyu wrote: > Merry Christmas. > Sorry for later response due to personal affair. > > On 2015年12月14日 03:30, Alexander Duyck wrote: >>> > These sounds we need to add a faked bridge for migration and adding a >>> > driver in the guest for it. It also needs to extend PCI bus/hotplug >>> > driver to do pause/resume other devices, right? >>> > >>> > My concern is still that whether we can change PCI bus/hotplug like that >>> > without spec change. >>> > >>> > IRQ should be general for any devices and we may extend it for >>> > migration. Device driver also can make decision to support migration >>> > or not. >> The device should have no say in the matter. Either we are going to >> migrate or we will not. This is why I have suggested my approach as >> it allows for the least amount of driver intrusion while providing the >> maximum number of ways to still perform migration even if the device >> doesn't support it. > > Even if the device driver doesn't support migration, you still want to > migrate VM? That maybe risk and we should add the "bad path" for the > driver at least. At a minimum we should have support for hot-plug if we are expecting to support migration. You would simply have to hot-plug the device before you start migration and then return it after. That is how the current bonding approach for this works if I am not mistaken. The advantage we are looking to gain is to avoid removing/disabling the device for as long as possible. Ideally we want to keep the device active through the warm-up period, but if the guest doesn't do that we should still be able to fall back on the older approaches if needed. >> >> The solution I have proposed is simple: >> >> 1. Extend swiotlb to allow for a page dirtying functionality. >> >> This part is pretty straight forward. I'll submit a few patches >> later today as RFC that can provided the minimal functionality needed >> for this. > > Very appreciate to do that. > >> >> 2. Provide a vendor specific configuration space option on the QEMU >> implementation of a PCI bridge to act as a bridge between direct >> assigned devices and the host bridge. >> >> My thought was to add some vendor specific block that includes a >> capabilities, status, and control register so you could go through and >> synchronize things like the DMA page dirtying feature. The bridge >> itself could manage the migration capable bit inside QEMU for all >> devices assigned to it. So if you added a VF to the bridge it would >> flag that you can support migration in QEMU, while the bridge would >> indicate you cannot until the DMA page dirtying control bit is set by >> the guest. >> >> We could also go through and optimize the DMA page dirtying after >> this is added so that we can narrow down the scope of use, and as a >> result improve the performance for other devices that don't need to >> support migration. It would then be a matter of adding an interrupt >> in the device to handle an event such as the DMA page dirtying status >> bit being set in the config space status register, while the bit is >> not set in the control register. If it doesn't get set then we would >> have to evict the devices before the warm-up phase of the migration, >> otherwise we can defer it until the end of the warm-up phase. >> >> 3. Extend existing shpc driver to support the optional "pause" >> functionality as called out in section 4.1.2 of the Revision 1.1 PCI >> hot-plug specification. > > Since your solution has added a faked PCI bridge. Why not notify the > bridge directly during migration via irq and call device driver's > callback in the new bridge driver? > > Otherwise, the new bridge driver also can check whether the device > driver provides migration callback or not and call them to improve the > passthough device's performance during migration. This is basically what I had in mind. Though I would take things one step further. You don't need to add any new call-backs if you make use of the existing suspend/resume logic. For a VF this does exactly what you would need since the VFs don't support wake on LAN so it will simply clear the bus master enable and put the netdev in a suspended state until the resume can be called. The PCI hot-plug specification calls out that the OS can optionally implement a "pause" mechanism which is meant to be used for high availability type environments. What I am proposing is basically extending the standard SHPC capable PCI bridge so that we can support the DMA page dirtying for everything hosted on it, add a vendor specific block to the config space so that the guest can notify the host that it will do page dirtying, and add a mechanism to indicate that all hot-plug events during the warm-up phase of the migration are pause events instead of full removals. I've been poking around in the kernel and QEMU code and the part I have been trying to sort out is how to get QEMU based pci-bridge to use the SHPC
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Fri, Dec 25, 2015 at 03:03:47PM +0800, Lan Tianyu wrote: > Merry Christmas. > Sorry for later response due to personal affair. > > On 2015年12月14日 03:30, Alexander Duyck wrote: > >> > These sounds we need to add a faked bridge for migration and adding a > >> > driver in the guest for it. It also needs to extend PCI bus/hotplug > >> > driver to do pause/resume other devices, right? > >> > > >> > My concern is still that whether we can change PCI bus/hotplug like that > >> > without spec change. > >> > > >> > IRQ should be general for any devices and we may extend it for > >> > migration. Device driver also can make decision to support migration > >> > or not. > > The device should have no say in the matter. Either we are going to > > migrate or we will not. This is why I have suggested my approach as > > it allows for the least amount of driver intrusion while providing the > > maximum number of ways to still perform migration even if the device > > doesn't support it. > > Even if the device driver doesn't support migration, you still want to > migrate VM? That maybe risk and we should add the "bad path" for the > driver at least. > > > > > The solution I have proposed is simple: > > > > 1. Extend swiotlb to allow for a page dirtying functionality. > > > > This part is pretty straight forward. I'll submit a few patches > > later today as RFC that can provided the minimal functionality needed > > for this. > > Very appreciate to do that. > > > > > 2. Provide a vendor specific configuration space option on the QEMU > > implementation of a PCI bridge to act as a bridge between direct > > assigned devices and the host bridge. > > > > My thought was to add some vendor specific block that includes a > > capabilities, status, and control register so you could go through and > > synchronize things like the DMA page dirtying feature. The bridge > > itself could manage the migration capable bit inside QEMU for all > > devices assigned to it. So if you added a VF to the bridge it would > > flag that you can support migration in QEMU, while the bridge would > > indicate you cannot until the DMA page dirtying control bit is set by > > the guest. > > > > We could also go through and optimize the DMA page dirtying after > > this is added so that we can narrow down the scope of use, and as a > > result improve the performance for other devices that don't need to > > support migration. It would then be a matter of adding an interrupt > > in the device to handle an event such as the DMA page dirtying status > > bit being set in the config space status register, while the bit is > > not set in the control register. If it doesn't get set then we would > > have to evict the devices before the warm-up phase of the migration, > > otherwise we can defer it until the end of the warm-up phase. > > > > 3. Extend existing shpc driver to support the optional "pause" > > functionality as called out in section 4.1.2 of the Revision 1.1 PCI > > hot-plug specification. > > Since your solution has added a faked PCI bridge. Why not notify the > bridge directly during migration via irq and call device driver's > callback in the new bridge driver? > > Otherwise, the new bridge driver also can check whether the device > driver provides migration callback or not and call them to improve the > passthough device's performance during migration. As long as you keep up this vague talk about performance during migration, without even bothering with any measurements, this patchset will keep going nowhere. There's Alex's patch that tracks memory changes during migration. It needs some simple enhancements to be useful in production (e.g. add a host/guest handshake to both enable tracking in guest and to detect the support in host), then it can allow starting migration with an assigned device, by invoking hot-unplug after most of memory have been migrated. Please implement this in qemu and measure the speed. I will not be surprised if destroying/creating netdev in linux turns out to take too long, but before anyone bothered checking, it does not make sense to discuss further enhancements. > > > > Note I call out "extend" here instead of saying to add this. > > Basically what we should do is provide a means of quiescing the device > > without unloading the driver. This is called out as something the OS > > vendor can optionally implement in the PCI hot-plug specification. On > > OSes that wouldn't support this it would just be treated as a standard > > hot-plug event. We could add a capability, status, and control bit > > in the vendor specific configuration block for this as well and if we > > set the status bit would indicate the host wants to pause instead of > > remove and the control bit would indicate the guest supports "pause" > > in the OS. We then could optionally disable guest migration while the > > VF is present and pause is not supported. > > > > To support this we would need
Re: [Qemu-devel] live migration vs device assignment (motivation)
Merry Christmas. Sorry for later response due to personal affair. On 2015年12月14日 03:30, Alexander Duyck wrote: >> > These sounds we need to add a faked bridge for migration and adding a >> > driver in the guest for it. It also needs to extend PCI bus/hotplug >> > driver to do pause/resume other devices, right? >> > >> > My concern is still that whether we can change PCI bus/hotplug like that >> > without spec change. >> > >> > IRQ should be general for any devices and we may extend it for >> > migration. Device driver also can make decision to support migration >> > or not. > The device should have no say in the matter. Either we are going to > migrate or we will not. This is why I have suggested my approach as > it allows for the least amount of driver intrusion while providing the > maximum number of ways to still perform migration even if the device > doesn't support it. Even if the device driver doesn't support migration, you still want to migrate VM? That maybe risk and we should add the "bad path" for the driver at least. > > The solution I have proposed is simple: > > 1. Extend swiotlb to allow for a page dirtying functionality. > > This part is pretty straight forward. I'll submit a few patches > later today as RFC that can provided the minimal functionality needed > for this. Very appreciate to do that. > > 2. Provide a vendor specific configuration space option on the QEMU > implementation of a PCI bridge to act as a bridge between direct > assigned devices and the host bridge. > > My thought was to add some vendor specific block that includes a > capabilities, status, and control register so you could go through and > synchronize things like the DMA page dirtying feature. The bridge > itself could manage the migration capable bit inside QEMU for all > devices assigned to it. So if you added a VF to the bridge it would > flag that you can support migration in QEMU, while the bridge would > indicate you cannot until the DMA page dirtying control bit is set by > the guest. > > We could also go through and optimize the DMA page dirtying after > this is added so that we can narrow down the scope of use, and as a > result improve the performance for other devices that don't need to > support migration. It would then be a matter of adding an interrupt > in the device to handle an event such as the DMA page dirtying status > bit being set in the config space status register, while the bit is > not set in the control register. If it doesn't get set then we would > have to evict the devices before the warm-up phase of the migration, > otherwise we can defer it until the end of the warm-up phase. > > 3. Extend existing shpc driver to support the optional "pause" > functionality as called out in section 4.1.2 of the Revision 1.1 PCI > hot-plug specification. Since your solution has added a faked PCI bridge. Why not notify the bridge directly during migration via irq and call device driver's callback in the new bridge driver? Otherwise, the new bridge driver also can check whether the device driver provides migration callback or not and call them to improve the passthough device's performance during migration. > > Note I call out "extend" here instead of saying to add this. > Basically what we should do is provide a means of quiescing the device > without unloading the driver. This is called out as something the OS > vendor can optionally implement in the PCI hot-plug specification. On > OSes that wouldn't support this it would just be treated as a standard > hot-plug event. We could add a capability, status, and control bit > in the vendor specific configuration block for this as well and if we > set the status bit would indicate the host wants to pause instead of > remove and the control bit would indicate the guest supports "pause" > in the OS. We then could optionally disable guest migration while the > VF is present and pause is not supported. > > To support this we would need to add a timer and if a new device > is not inserted in some period of time (60 seconds for example), or if > a different device is inserted, > we need to unload the original driver > from the device. In addition we would need to verify if drivers can > call the remove function after having called suspend without resume. > If not, we could look at adding a recovery function to remove the > driver from the device in the case of a suspend with either a failed > resume or no resume call. Once again it would probably be useful to > have for those cases where power management suspend/resume runs into > an issue like somebody causing a surprise removal while a device was > suspended. -- Best regards Tianyu Lan -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Sun, Dec 13, 2015 at 11:47:44PM +0800, Lan, Tianyu wrote: > > > On 12/11/2015 1:16 AM, Alexander Duyck wrote: > >On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu wrote: > >> > >> > >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: > > Ideally, it is able to leave guest driver unmodified but it requires the > >hypervisor or qemu to aware the device which means we may need a driver > >in > >hypervisor or qemu to handle the device on behalf of guest driver. > >>> > >>>Can you answer the question of when do you use your code - > >>> at the start of migration or > >>> just before the end? > >> > >> > >>Just before stopping VCPU in this version and inject VF mailbox irq to > >>notify the driver if the irq handler is installed. > >>Qemu side also will check this via the faked PCI migration capability > >>and driver will set the status during device open() or resume() callback. > > > >The VF mailbox interrupt is a very bad idea. Really the device should > >be in a reset state on the other side of a migration. It doesn't make > >sense to have the interrupt firing if the device is not configured. > >This is one of the things that is preventing you from being able to > >migrate the device while the interface is administratively down or the > >VF driver is not loaded. > > From my opinion, if VF driver is not loaded and hardware doesn't start > to work, the device state doesn't need to be migrated. > > We may add a flag for driver to check whether migration happened during it's > down and reinitialize the hardware and clear the flag when system try to put > it up. > > We may add migration core in the Linux kernel and provide some helps > functions to facilitate to add migration support for drivers. > Migration core is in charge to sync status with Qemu. > > Example. > migration_register() > Driver provides > - Callbacks to be called before and after migration or for bad path > - Its irq which it prefers to deal with migration event. > > migration_event_check() > Driver calls it in the irq handler. Migration core code will check > migration status and call its callbacks when migration happens. > > > > > >My thought on all this is that it might make sense to move this > >functionality into a PCI-to-PCI bridge device and make it a > >requirement that all direct-assigned devices have to exist behind that > >device in order to support migration. That way you would be working > >with a directly emulated device that would likely already be > >supporting hot-plug anyway. Then it would just be a matter of coming > >up with a few Qemu specific extensions that you would need to add to > >the device itself. The same approach would likely be portable enough > >that you could achieve it with PCIe as well via the same configuration > >space being present on the upstream side of a PCIe port or maybe a > >PCIe switch of some sort. > > > >It would then be possible to signal via your vendor-specific PCI > >capability on that device that all devices behind this bridge require > >DMA page dirtying, you could use the configuration in addition to the > >interrupt already provided for hot-plug to signal things like when you > >are starting migration, and possibly even just extend the shpc > >functionality so that if this capability is present you have the > >option to pause/resume instead of remove/probe the device in the case > >of certain hot-plug events. The fact is there may be some use for a > >pause/resume type approach for PCIe hot-plug in the near future > >anyway. From the sounds of it Apple has required it for all > >Thunderbolt device drivers so that they can halt the device in order > >to shuffle resources around, perhaps we should look at something > >similar for Linux. > > > >The other advantage behind grouping functions on one bridge is things > >like reset domains. The PCI error handling logic will want to be able > >to reset any devices that experienced an error in the event of > >something such as a surprise removal. By grouping all of the devices > >you could disable/reset/enable them as one logical group in the event > >of something such as the "bad path" approach Michael has mentioned. > > > > These sounds we need to add a faked bridge for migration and adding a > driver in the guest for it. It also needs to extend PCI bus/hotplug > driver to do pause/resume other devices, right? > > My concern is still that whether we can change PCI bus/hotplug like that > without spec change. > > IRQ should be general for any devices and we may extend it for > migration. Device driver also can make decision to support migration > or not. A dedicated IRQ per device for something that is a system wide event sounds like a waste. I don't understand why a spec change is strictly required, we only need to support this with the specific virtual bridge used by QEMU, so I think that a vendor specific capability will do. Once this works well in the field, a PCI spec ECN might make sense to
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Fri, Dec 11, 2015 at 03:32:04PM +0800, Lan, Tianyu wrote: > > > On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote: > >On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote: > >> > >> > >>On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: > Ideally, it is able to leave guest driver unmodified but it requires the > >hypervisor or qemu to aware the device which means we may need a driver > >in > >hypervisor or qemu to handle the device on behalf of guest driver. > >>>Can you answer the question of when do you use your code - > >>>at the start of migration or > >>>just before the end? > >> > >>Just before stopping VCPU in this version and inject VF mailbox irq to > >>notify the driver if the irq handler is installed. > >>Qemu side also will check this via the faked PCI migration capability > >>and driver will set the status during device open() or resume() callback. > > > >Right, this is the "good path" optimization. Whether this buys anything > >as compared to just sending reset to the device when VCPU is stopped > >needs to be measured. In any case, we probably do need a way to > >interrupt driver on destination to make it reconfigure the device - > >otherwise it might take seconds for it to notice. And a way to make > >sure driver can handle this surprise reset so we can block migration if > >it can't. > > > > Yes, we need such a way to notify driver about migration status and do > reset or restore operation on the destination machine. My original > design is to take advantage of device's irq to do that. Driver can tell > Qemu that which irq it prefers to handle such task and whether the irq > is enabled or bound with handler. We may discuss the detail in the other > thread. > > >>> > >>>It would be great if we could avoid changing the guest; but at least > >>>your guest > >>>driver changes don't actually seem to be that hardware specific; could > >>>your > >>>changes actually be moved to generic PCI level so they could be made > >>>to work for lots of drivers? > > > >It is impossible to use one common solution for all devices unless the > >PCIE > >spec documents it clearly and i think one day it will be there. But > >before > >that, we need some workarounds on guest driver to make it work even it > >looks > >ugly. > >> > >>Yes, so far there is not hardware migration support > > > >VT-D supports setting dirty bit in the PTE in hardware. > > Actually, this doesn't support in the current hardware. > VTD spec documents the dirty bit for first level translation which > requires devices to support DMA request with PASID(process > address space identifier). Most device don't support the feature. True, I missed this. It's generally unfortunate that first level translation only applies to requests with PASID. All other features limited to requests with PASID like nested translation would be very useful for all requests, not just requests with PASID. > > > >>and it's hard to modify > >>bus level code. > > > >Why is it hard? > > As Yang said, the concern is that PCI Spec doesn't document about how to do > migration. We can submit a PCI spec ECN documenting a new capability. I think for existing devices which lack it, adding this capability to the bridge to which the device is attached is preferable to trying to add it to the device itself. > > > >>It also will block implementation on the Windows. > > > >Implementation of what? We are discussing motivation here, not > >implementation. E.g. windows drivers typically support surprise > >removal, should you use that, you get some working code for free. Just > >stop worrying about it. Make it work, worry about closed source > >software later. > > > >>>Dave > >>> -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Sun, Dec 13, 2015 at 7:47 AM, Lan, Tianyu wrote: > > > On 12/11/2015 1:16 AM, Alexander Duyck wrote: >> >> On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu wrote: >>> >>> >>> >>> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: > > > Ideally, it is able to leave guest driver unmodified but it requires > the >> >> hypervisor or qemu to aware the device which means we may need a >> driver >> in >> hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? >>> >>> >>> >>> Just before stopping VCPU in this version and inject VF mailbox irq to >>> notify the driver if the irq handler is installed. >>> Qemu side also will check this via the faked PCI migration capability >>> and driver will set the status during device open() or resume() callback. >> >> >> The VF mailbox interrupt is a very bad idea. Really the device should >> be in a reset state on the other side of a migration. It doesn't make >> sense to have the interrupt firing if the device is not configured. >> This is one of the things that is preventing you from being able to >> migrate the device while the interface is administratively down or the >> VF driver is not loaded. > > > From my opinion, if VF driver is not loaded and hardware doesn't start > to work, the device state doesn't need to be migrated. > > We may add a flag for driver to check whether migration happened during it's > down and reinitialize the hardware and clear the flag when system try to put > it up. > > We may add migration core in the Linux kernel and provide some helps > functions to facilitate to add migration support for drivers. > Migration core is in charge to sync status with Qemu. > > Example. > migration_register() > Driver provides > - Callbacks to be called before and after migration or for bad path > - Its irq which it prefers to deal with migration event. You would be better off just using function pointers in the pci_driver struct and let the PCI driver registration take care of all that. > migration_event_check() > Driver calls it in the irq handler. Migration core code will check > migration status and call its callbacks when migration happens. No, this is still a bad idea. You haven't addressed what you do when the device has had interrupts disabled such as being in the down state. This is the biggest issue I see with your whole patch set. It requires the driver containing certain changes and being in a certain state. You cannot put those expectations on the guest. You really need to try and move as much of this out to existing functionality as possible. >> >> My thought on all this is that it might make sense to move this >> functionality into a PCI-to-PCI bridge device and make it a >> requirement that all direct-assigned devices have to exist behind that >> device in order to support migration. That way you would be working >> with a directly emulated device that would likely already be >> supporting hot-plug anyway. Then it would just be a matter of coming >> up with a few Qemu specific extensions that you would need to add to >> the device itself. The same approach would likely be portable enough >> that you could achieve it with PCIe as well via the same configuration >> space being present on the upstream side of a PCIe port or maybe a >> PCIe switch of some sort. >> >> It would then be possible to signal via your vendor-specific PCI >> capability on that device that all devices behind this bridge require >> DMA page dirtying, you could use the configuration in addition to the >> interrupt already provided for hot-plug to signal things like when you >> are starting migration, and possibly even just extend the shpc >> functionality so that if this capability is present you have the >> option to pause/resume instead of remove/probe the device in the case >> of certain hot-plug events. The fact is there may be some use for a >> pause/resume type approach for PCIe hot-plug in the near future >> anyway. From the sounds of it Apple has required it for all >> Thunderbolt device drivers so that they can halt the device in order >> to shuffle resources around, perhaps we should look at something >> similar for Linux. >> >> The other advantage behind grouping functions on one bridge is things >> like reset domains. The PCI error handling logic will want to be able >> to reset any devices that experienced an error in the event of >> something such as a surprise removal. By grouping all of the devices >> you could disable/reset/enable them as one logical group in the event >> of something such as the "bad path" approach Michael has mentioned. >> > > These sounds we need to add a faked bridge for migration and adding a > driver in the guest for it. It also needs to extend PCI bus/hotplug > driver to do pause/resume other devices, right? > > My concern is still
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/11/2015 1:16 AM, Alexander Duyck wrote: On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu wrote: On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: Ideally, it is able to leave guest driver unmodified but it requires the hypervisor or qemu to aware the device which means we may need a driver in hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? Just before stopping VCPU in this version and inject VF mailbox irq to notify the driver if the irq handler is installed. Qemu side also will check this via the faked PCI migration capability and driver will set the status during device open() or resume() callback. The VF mailbox interrupt is a very bad idea. Really the device should be in a reset state on the other side of a migration. It doesn't make sense to have the interrupt firing if the device is not configured. This is one of the things that is preventing you from being able to migrate the device while the interface is administratively down or the VF driver is not loaded. From my opinion, if VF driver is not loaded and hardware doesn't start to work, the device state doesn't need to be migrated. We may add a flag for driver to check whether migration happened during it's down and reinitialize the hardware and clear the flag when system try to put it up. We may add migration core in the Linux kernel and provide some helps functions to facilitate to add migration support for drivers. Migration core is in charge to sync status with Qemu. Example. migration_register() Driver provides - Callbacks to be called before and after migration or for bad path - Its irq which it prefers to deal with migration event. migration_event_check() Driver calls it in the irq handler. Migration core code will check migration status and call its callbacks when migration happens. My thought on all this is that it might make sense to move this functionality into a PCI-to-PCI bridge device and make it a requirement that all direct-assigned devices have to exist behind that device in order to support migration. That way you would be working with a directly emulated device that would likely already be supporting hot-plug anyway. Then it would just be a matter of coming up with a few Qemu specific extensions that you would need to add to the device itself. The same approach would likely be portable enough that you could achieve it with PCIe as well via the same configuration space being present on the upstream side of a PCIe port or maybe a PCIe switch of some sort. It would then be possible to signal via your vendor-specific PCI capability on that device that all devices behind this bridge require DMA page dirtying, you could use the configuration in addition to the interrupt already provided for hot-plug to signal things like when you are starting migration, and possibly even just extend the shpc functionality so that if this capability is present you have the option to pause/resume instead of remove/probe the device in the case of certain hot-plug events. The fact is there may be some use for a pause/resume type approach for PCIe hot-plug in the near future anyway. From the sounds of it Apple has required it for all Thunderbolt device drivers so that they can halt the device in order to shuffle resources around, perhaps we should look at something similar for Linux. The other advantage behind grouping functions on one bridge is things like reset domains. The PCI error handling logic will want to be able to reset any devices that experienced an error in the event of something such as a surprise removal. By grouping all of the devices you could disable/reset/enable them as one logical group in the event of something such as the "bad path" approach Michael has mentioned. These sounds we need to add a faked bridge for migration and adding a driver in the guest for it. It also needs to extend PCI bus/hotplug driver to do pause/resume other devices, right? My concern is still that whether we can change PCI bus/hotplug like that without spec change. IRQ should be general for any devices and we may extend it for migration. Device driver also can make decision to support migration or not. It would be great if we could avoid changing the guest; but at least your guest driver changes don't actually seem to be that hardware specific; could your changes actually be moved to generic PCI level so they could be made to work for lots of drivers? It is impossible to use one common solution for all devices unless the PCIE spec documents it clearly and i think one day it will be there. But before that, we need some workarounds on guest driver to make it work even it looks ugly. Yes, so far there is not hardware migration support and it's hard to modify bus level code. It also will block implementation on the Windows. Please don't assume things. Unless you
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/11/2015 12:11 AM, Michael S. Tsirkin wrote: On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote: On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: Ideally, it is able to leave guest driver unmodified but it requires the hypervisor or qemu to aware the device which means we may need a driver in hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? Just before stopping VCPU in this version and inject VF mailbox irq to notify the driver if the irq handler is installed. Qemu side also will check this via the faked PCI migration capability and driver will set the status during device open() or resume() callback. Right, this is the "good path" optimization. Whether this buys anything as compared to just sending reset to the device when VCPU is stopped needs to be measured. In any case, we probably do need a way to interrupt driver on destination to make it reconfigure the device - otherwise it might take seconds for it to notice. And a way to make sure driver can handle this surprise reset so we can block migration if it can't. Yes, we need such a way to notify driver about migration status and do reset or restore operation on the destination machine. My original design is to take advantage of device's irq to do that. Driver can tell Qemu that which irq it prefers to handle such task and whether the irq is enabled or bound with handler. We may discuss the detail in the other thread. It would be great if we could avoid changing the guest; but at least your guest driver changes don't actually seem to be that hardware specific; could your changes actually be moved to generic PCI level so they could be made to work for lots of drivers? It is impossible to use one common solution for all devices unless the PCIE spec documents it clearly and i think one day it will be there. But before that, we need some workarounds on guest driver to make it work even it looks ugly. Yes, so far there is not hardware migration support VT-D supports setting dirty bit in the PTE in hardware. Actually, this doesn't support in the current hardware. VTD spec documents the dirty bit for first level translation which requires devices to support DMA request with PASID(process address space identifier). Most device don't support the feature. and it's hard to modify bus level code. Why is it hard? As Yang said, the concern is that PCI Spec doesn't document about how to do migration. It also will block implementation on the Windows. Implementation of what? We are discussing motivation here, not implementation. E.g. windows drivers typically support surprise removal, should you use that, you get some working code for free. Just stop worrying about it. Make it work, worry about closed source software later. Dave -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Thu, Dec 10, 2015 at 8:11 AM, Michael S. Tsirkin wrote: > On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote: >> >> >> On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: >> >>Ideally, it is able to leave guest driver unmodified but it requires the >> >>>hypervisor or qemu to aware the device which means we may need a driver in >> >>>hypervisor or qemu to handle the device on behalf of guest driver. >> >Can you answer the question of when do you use your code - >> >at the start of migration or >> >just before the end? >> >> Just before stopping VCPU in this version and inject VF mailbox irq to >> notify the driver if the irq handler is installed. >> Qemu side also will check this via the faked PCI migration capability >> and driver will set the status during device open() or resume() callback. > > Right, this is the "good path" optimization. Whether this buys anything > as compared to just sending reset to the device when VCPU is stopped > needs to be measured. In any case, we probably do need a way to > interrupt driver on destination to make it reconfigure the device - > otherwise it might take seconds for it to notice. And a way to make > sure driver can handle this surprise reset so we can block migration if > it can't. The question is how do we handle the "bad path"? From what I can tell it seems like we would have to have the dirty page tracking for DMA handled in the host in order to support that. Otherwise we risk corrupting the memory in the guest as there are going to be a few stale pages that end up being in the guest. The easiest way to probably flag a "bad path" migration would be to emulate a Manually-operated Retention Latch being opened and closed on the device. It may even allow us to work with the desire to support a means for doing a pause/resume as that would be a hot-plug event where the latch was never actually opened. Basically if the retention latch is released and then re-closed it can be assumed that the device has lost power and as a result been reset. As such a normal hot-plug controller would have to reconfigure the device in such an event. The key bit being that with the power being cycled on the port the assumption is that the device has lost any existing state, and we should emulate that as well by clearing any state Qemu might be carrying such as the shadow of the MSI-X table. In addition we could also signal if the host supports the dirty page tracking via the IOMMU so if needed the guest could trigger some sort of memory exception handling due to the risk of memory corruption. I would argue that we don't necessarily have to provide a means to guarantee the driver can support a surprise removal/reset. Worst case scenario is that it would be equivalent to somebody pulling the plug on an externally connected PCIe cage in a physical host. I know the Intel Ethernet drivers have already had to add support for surprise removal due to the fact that such a scenario can occur on Thunderbolt enabled platforms. Since it is acceptable for physical hosts to have such an event occur I think we could support the same type of failure for direct assigned devices in guests. That would be the one spot where I would say it is up to the drivers to figure out how they are going to deal with it since this is something that can occur for any given driver on any given OS assuming it can be plugged into an externally removable cage. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Thu, Dec 10, 2015 at 6:38 AM, Lan, Tianyu wrote: > > > On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: >>> >>> Ideally, it is able to leave guest driver unmodified but it requires the >>> >hypervisor or qemu to aware the device which means we may need a driver >>> > in >>> >hypervisor or qemu to handle the device on behalf of guest driver. >> >> Can you answer the question of when do you use your code - >> at the start of migration or >> just before the end? > > > Just before stopping VCPU in this version and inject VF mailbox irq to > notify the driver if the irq handler is installed. > Qemu side also will check this via the faked PCI migration capability > and driver will set the status during device open() or resume() callback. The VF mailbox interrupt is a very bad idea. Really the device should be in a reset state on the other side of a migration. It doesn't make sense to have the interrupt firing if the device is not configured. This is one of the things that is preventing you from being able to migrate the device while the interface is administratively down or the VF driver is not loaded. My thought on all this is that it might make sense to move this functionality into a PCI-to-PCI bridge device and make it a requirement that all direct-assigned devices have to exist behind that device in order to support migration. That way you would be working with a directly emulated device that would likely already be supporting hot-plug anyway. Then it would just be a matter of coming up with a few Qemu specific extensions that you would need to add to the device itself. The same approach would likely be portable enough that you could achieve it with PCIe as well via the same configuration space being present on the upstream side of a PCIe port or maybe a PCIe switch of some sort. It would then be possible to signal via your vendor-specific PCI capability on that device that all devices behind this bridge require DMA page dirtying, you could use the configuration in addition to the interrupt already provided for hot-plug to signal things like when you are starting migration, and possibly even just extend the shpc functionality so that if this capability is present you have the option to pause/resume instead of remove/probe the device in the case of certain hot-plug events. The fact is there may be some use for a pause/resume type approach for PCIe hot-plug in the near future anyway. From the sounds of it Apple has required it for all Thunderbolt device drivers so that they can halt the device in order to shuffle resources around, perhaps we should look at something similar for Linux. The other advantage behind grouping functions on one bridge is things like reset domains. The PCI error handling logic will want to be able to reset any devices that experienced an error in the event of something such as a surprise removal. By grouping all of the devices you could disable/reset/enable them as one logical group in the event of something such as the "bad path" approach Michael has mentioned. >> > >It would be great if we could avoid changing the guest; but at least > > your guest > >driver changes don't actually seem to be that hardware specific; > > could your > >changes actually be moved to generic PCI level so they could be made > >to work for lots of drivers? >>> >>> > >>> >It is impossible to use one common solution for all devices unless the >>> > PCIE >>> >spec documents it clearly and i think one day it will be there. But >>> > before >>> >that, we need some workarounds on guest driver to make it work even it >>> > looks >>> >ugly. > > > Yes, so far there is not hardware migration support and it's hard to modify > bus level code. It also will block implementation on the Windows. Please don't assume things. Unless you have hard data from Microsoft that says they want it this way lets just try to figure out what works best for us for now and then we can start worrying about third party implementations after we have figured out a solution that actually works. - Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
* Lan, Tianyu (tianyu@intel.com) wrote: > > > On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: > >>Ideally, it is able to leave guest driver unmodified but it requires the > >>>hypervisor or qemu to aware the device which means we may need a driver in > >>>hypervisor or qemu to handle the device on behalf of guest driver. > >Can you answer the question of when do you use your code - > >at the start of migration or > >just before the end? > > Just before stopping VCPU in this version and inject VF mailbox irq to > notify the driver if the irq handler is installed. > Qemu side also will check this via the faked PCI migration capability > and driver will set the status during device open() or resume() callback. OK, hmm - I can see that would work in some cases; but: a) It wouldn't work if the guest was paused, the management can pause it before starting migration or during migration - so you might need to hook the pause as well; so that's a bit complicated. b) How long does qemu wait for the guest to respond, and what does it do if the guest doesn't respond ? How do we recover? c) How much work does the guest need to do at this point? d) It would be great if we could find a more generic way of telling the guest it's about to migrate rather than via the PCI registers of one device; imagine what happens if you have a few different devices using SR-IOV, we'd have to tell them all with separate interrupts. Perhaps we could use a virtio channel or an ACPI event or something? > >It would be great if we could avoid changing the guest; but at least > >your guest > >driver changes don't actually seem to be that hardware specific; could > >your > >changes actually be moved to generic PCI level so they could be made > >to work for lots of drivers? > >>> > >>>It is impossible to use one common solution for all devices unless the PCIE > >>>spec documents it clearly and i think one day it will be there. But before > >>>that, we need some workarounds on guest driver to make it work even it > >>>looks > >>>ugly. > > Yes, so far there is not hardware migration support and it's hard to modify > bus level code. It also will block implementation on the Windows. Well, there was agraf's trick, although that's a lot more complicated at the qemu level, but it should work with no guest modifications. Michael's point about dirty page tracking is neat, I think that simplifies it a bit if it can track dirty pages. Dave > >Dave > > -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On Thu, Dec 10, 2015 at 10:38:32PM +0800, Lan, Tianyu wrote: > > > On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: > >>Ideally, it is able to leave guest driver unmodified but it requires the > >>>hypervisor or qemu to aware the device which means we may need a driver in > >>>hypervisor or qemu to handle the device on behalf of guest driver. > >Can you answer the question of when do you use your code - > >at the start of migration or > >just before the end? > > Just before stopping VCPU in this version and inject VF mailbox irq to > notify the driver if the irq handler is installed. > Qemu side also will check this via the faked PCI migration capability > and driver will set the status during device open() or resume() callback. Right, this is the "good path" optimization. Whether this buys anything as compared to just sending reset to the device when VCPU is stopped needs to be measured. In any case, we probably do need a way to interrupt driver on destination to make it reconfigure the device - otherwise it might take seconds for it to notice. And a way to make sure driver can handle this surprise reset so we can block migration if it can't. > > > >It would be great if we could avoid changing the guest; but at least > >your guest > >driver changes don't actually seem to be that hardware specific; could > >your > >changes actually be moved to generic PCI level so they could be made > >to work for lots of drivers? > >>> > >>>It is impossible to use one common solution for all devices unless the PCIE > >>>spec documents it clearly and i think one day it will be there. But before > >>>that, we need some workarounds on guest driver to make it work even it > >>>looks > >>>ugly. > > Yes, so far there is not hardware migration support VT-D supports setting dirty bit in the PTE in hardware. > and it's hard to modify > bus level code. Why is it hard? > It also will block implementation on the Windows. Implementation of what? We are discussing motivation here, not implementation. E.g. windows drivers typically support surprise removal, should you use that, you get some working code for free. Just stop worrying about it. Make it work, worry about closed source software later. > >Dave > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 12/10/2015 7:41 PM, Dr. David Alan Gilbert wrote: Ideally, it is able to leave guest driver unmodified but it requires the >hypervisor or qemu to aware the device which means we may need a driver in >hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? Just before stopping VCPU in this version and inject VF mailbox irq to notify the driver if the irq handler is installed. Qemu side also will check this via the faked PCI migration capability and driver will set the status during device open() or resume() callback. > >It would be great if we could avoid changing the guest; but at least your guest > >driver changes don't actually seem to be that hardware specific; could your > >changes actually be moved to generic PCI level so they could be made > >to work for lots of drivers? > >It is impossible to use one common solution for all devices unless the PCIE >spec documents it clearly and i think one day it will be there. But before >that, we need some workarounds on guest driver to make it work even it looks >ugly. Yes, so far there is not hardware migration support and it's hard to modify bus level code. It also will block implementation on the Windows. Dave -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 2015/12/10 19:41, Dr. David Alan Gilbert wrote: * Yang Zhang (yang.zhang...@gmail.com) wrote: On 2015/12/10 18:18, Dr. David Alan Gilbert wrote: * Lan, Tianyu (tianyu@intel.com) wrote: On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: I thought about what this is doing at the high level, and I do have some value in what you are trying to do, but I also think we need to clarify the motivation a bit more. What you are saying is not really what the patches are doing. And with that clearer understanding of the motivation in mind (assuming it actually captures a real need), I would also like to suggest some changes. Motivation: Most current solutions for migration with passthough device are based on the PCI hotplug but it has side affect and can't work for all device. For NIC device: PCI hotplug solution can work around Network device migration via switching VF and PF. But switching network interface will introduce service down time. I tested the service down time via putting VF and PV interface into a bonded interface and ping the bonded interface during plug and unplug VF. 1) About 100ms when add VF 2) About 30ms when del VF It also requires guest to do switch configuration. These are hard to manage and deploy from our customers. To maintain PV performance during migration, host side also needs to assign a VF to PV device. This affects scalability. These factors block SRIOV NIC passthough usage in the cloud service and OPNFV which require network high performance and stability a lot. Right, that I'll agree it's hard to do migration of a VM which uses an SRIOV device; and while I think it should be possible to bond a virtio device to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage. For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. I don't understand that QAT argument; if the device is purely an offload engine for performance, then why can't you fall back to doing the same operations in the VM or in QEMU if the card is unavailable? The tricky bit is dealing with outstanding operations. So we are trying implementing a new solution which really migrates device state to target machine and won't affect user during migration with low service down time. Right, that's a good aim - the only question is how to do it. It looks like this is always going to need some device-specific code; the question I see is whether that's in: 1) qemu 2) the host kernel 3) the guest kernel driver The objections to this series seem to be that it needs changes to (3); I can see the worry that the guest kernel driver might not get a chance to run during the right time in migration and it's painful having to change every guest driver (although your change is small). My question is what stage of the migration process do you expect to tell the guest kernel driver to do this? If you do it at the start of the migration, and quiesce the device, the migration might take a long time (say 30 minutes) - are you intending the device to be quiesced for this long? And where are you going to send the traffic? If you are, then do you need to do it via this PCI trick, or could you just do it via something higher level to quiesce the device. Or are you intending to do it just near the end of the migration? But then how do we know how long it will take the guest driver to respond? Ideally, it is able to leave guest driver unmodified but it requires the hypervisor or qemu to aware the device which means we may need a driver in hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? Tianyu can answer this question. In my initial design, i prefer to put more modifications in hypervisor and Qemu, and the only involvement from guest driver is how to restore the state after migration. But I don't know the later implementation since i have left Intel. It would be great if we could avoid changing the guest; but at least your guest driver changes don't actually seem to be that hardware specific; could your changes actually be moved to generic PCI level so they could be made to work for lots of drivers? It is impossible to use one common solution for all devices unless the PCIE spec d
Re: [Qemu-devel] live migration vs device assignment (motivation)
* Yang Zhang (yang.zhang...@gmail.com) wrote: > On 2015/12/10 18:18, Dr. David Alan Gilbert wrote: > >* Lan, Tianyu (tianyu@intel.com) wrote: > >>On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: > >>>I thought about what this is doing at the high level, and I do have some > >>>value in what you are trying to do, but I also think we need to clarify > >>>the motivation a bit more. What you are saying is not really what the > >>>patches are doing. > >>> > >>>And with that clearer understanding of the motivation in mind (assuming > >>>it actually captures a real need), I would also like to suggest some > >>>changes. > >> > >>Motivation: > >>Most current solutions for migration with passthough device are based on > >>the PCI hotplug but it has side affect and can't work for all device. > >> > >>For NIC device: > >>PCI hotplug solution can work around Network device migration > >>via switching VF and PF. > >> > >>But switching network interface will introduce service down time. > >> > >>I tested the service down time via putting VF and PV interface > >>into a bonded interface and ping the bonded interface during plug > >>and unplug VF. > >>1) About 100ms when add VF > >>2) About 30ms when del VF > >> > >>It also requires guest to do switch configuration. These are hard to > >>manage and deploy from our customers. To maintain PV performance during > >>migration, host side also needs to assign a VF to PV device. This > >>affects scalability. > >> > >>These factors block SRIOV NIC passthough usage in the cloud service and > >>OPNFV which require network high performance and stability a lot. > > > >Right, that I'll agree it's hard to do migration of a VM which uses > >an SRIOV device; and while I think it should be possible to bond a virtio > >device > >to a VF for networking and then hotplug the SR-IOV device I agree it's hard > >to manage. > > > >>For other kind of devices, it's hard to work. > >>We are also adding migration support for QAT(QuickAssist Technology) device. > >> > >>QAT device user case introduction. > >>Server, networking, big data, and storage applications use QuickAssist > >>Technology to offload servers from handling compute-intensive operations, > >>such as: > >>1) Symmetric cryptography functions including cipher operations and > >>authentication operations > >>2) Public key functions including RSA, Diffie-Hellman, and elliptic curve > >>cryptography > >>3) Compression and decompression functions including DEFLATE and LZS > >> > >>PCI hotplug will not work for such devices during migration and these > >>operations will fail when unplug device. > > > >I don't understand that QAT argument; if the device is purely an offload > >engine for performance, then why can't you fall back to doing the > >same operations in the VM or in QEMU if the card is unavailable? > >The tricky bit is dealing with outstanding operations. > > > >>So we are trying implementing a new solution which really migrates > >>device state to target machine and won't affect user during migration > >>with low service down time. > > > >Right, that's a good aim - the only question is how to do it. > > > >It looks like this is always going to need some device-specific code; > >the question I see is whether that's in: > > 1) qemu > > 2) the host kernel > > 3) the guest kernel driver > > > >The objections to this series seem to be that it needs changes to (3); > >I can see the worry that the guest kernel driver might not get a chance > >to run during the right time in migration and it's painful having to > >change every guest driver (although your change is small). > > > >My question is what stage of the migration process do you expect to tell > >the guest kernel driver to do this? > > > > If you do it at the start of the migration, and quiesce the device, > > the migration might take a long time (say 30 minutes) - are you > > intending the device to be quiesced for this long? And where are > > you going to send the traffic? > > If you are, then do you need to do it via this PCI trick, or could > > you just do it via something higher level to quiesce the device. > > > > Or are you intending to do it just near the end of the migration? > > But then how do we know how long it will take the guest driver to > > respond? > > Ideally, it is able to leave guest driver unmodified but it requires the > hypervisor or qemu to aware the device which means we may need a driver in > hypervisor or qemu to handle the device on behalf of guest driver. Can you answer the question of when do you use your code - at the start of migration or just before the end? > >It would be great if we could avoid changing the guest; but at least your > >guest > >driver changes don't actually seem to be that hardware specific; could your > >changes actually be moved to generic PCI level so they could be made > >to work for lots of drivers? > > It is impossible to use one common solution for all devices un
Re: [Qemu-devel] live migration vs device assignment (motivation)
On 2015/12/10 18:18, Dr. David Alan Gilbert wrote: * Lan, Tianyu (tianyu@intel.com) wrote: On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: I thought about what this is doing at the high level, and I do have some value in what you are trying to do, but I also think we need to clarify the motivation a bit more. What you are saying is not really what the patches are doing. And with that clearer understanding of the motivation in mind (assuming it actually captures a real need), I would also like to suggest some changes. Motivation: Most current solutions for migration with passthough device are based on the PCI hotplug but it has side affect and can't work for all device. For NIC device: PCI hotplug solution can work around Network device migration via switching VF and PF. But switching network interface will introduce service down time. I tested the service down time via putting VF and PV interface into a bonded interface and ping the bonded interface during plug and unplug VF. 1) About 100ms when add VF 2) About 30ms when del VF It also requires guest to do switch configuration. These are hard to manage and deploy from our customers. To maintain PV performance during migration, host side also needs to assign a VF to PV device. This affects scalability. These factors block SRIOV NIC passthough usage in the cloud service and OPNFV which require network high performance and stability a lot. Right, that I'll agree it's hard to do migration of a VM which uses an SRIOV device; and while I think it should be possible to bond a virtio device to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage. For other kind of devices, it's hard to work. We are also adding migration support for QAT(QuickAssist Technology) device. QAT device user case introduction. Server, networking, big data, and storage applications use QuickAssist Technology to offload servers from handling compute-intensive operations, such as: 1) Symmetric cryptography functions including cipher operations and authentication operations 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve cryptography 3) Compression and decompression functions including DEFLATE and LZS PCI hotplug will not work for such devices during migration and these operations will fail when unplug device. I don't understand that QAT argument; if the device is purely an offload engine for performance, then why can't you fall back to doing the same operations in the VM or in QEMU if the card is unavailable? The tricky bit is dealing with outstanding operations. So we are trying implementing a new solution which really migrates device state to target machine and won't affect user during migration with low service down time. Right, that's a good aim - the only question is how to do it. It looks like this is always going to need some device-specific code; the question I see is whether that's in: 1) qemu 2) the host kernel 3) the guest kernel driver The objections to this series seem to be that it needs changes to (3); I can see the worry that the guest kernel driver might not get a chance to run during the right time in migration and it's painful having to change every guest driver (although your change is small). My question is what stage of the migration process do you expect to tell the guest kernel driver to do this? If you do it at the start of the migration, and quiesce the device, the migration might take a long time (say 30 minutes) - are you intending the device to be quiesced for this long? And where are you going to send the traffic? If you are, then do you need to do it via this PCI trick, or could you just do it via something higher level to quiesce the device. Or are you intending to do it just near the end of the migration? But then how do we know how long it will take the guest driver to respond? Ideally, it is able to leave guest driver unmodified but it requires the hypervisor or qemu to aware the device which means we may need a driver in hypervisor or qemu to handle the device on behalf of guest driver. It would be great if we could avoid changing the guest; but at least your guest driver changes don't actually seem to be that hardware specific; could your changes actually be moved to generic PCI level so they could be made to work for lots of drivers? It is impossible to use one common solution for all devices unless the PCIE spec documents it clearly and i think one day it will be there. But before that, we need some workarounds on guest driver to make it work even it looks ugly. -- best regards yang -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] live migration vs device assignment (motivation)
* Lan, Tianyu (tianyu@intel.com) wrote: > On 12/8/2015 12:50 AM, Michael S. Tsirkin wrote: > >I thought about what this is doing at the high level, and I do have some > >value in what you are trying to do, but I also think we need to clarify > >the motivation a bit more. What you are saying is not really what the > >patches are doing. > > > >And with that clearer understanding of the motivation in mind (assuming > >it actually captures a real need), I would also like to suggest some > >changes. > > Motivation: > Most current solutions for migration with passthough device are based on > the PCI hotplug but it has side affect and can't work for all device. > > For NIC device: > PCI hotplug solution can work around Network device migration > via switching VF and PF. > > But switching network interface will introduce service down time. > > I tested the service down time via putting VF and PV interface > into a bonded interface and ping the bonded interface during plug > and unplug VF. > 1) About 100ms when add VF > 2) About 30ms when del VF > > It also requires guest to do switch configuration. These are hard to > manage and deploy from our customers. To maintain PV performance during > migration, host side also needs to assign a VF to PV device. This > affects scalability. > > These factors block SRIOV NIC passthough usage in the cloud service and > OPNFV which require network high performance and stability a lot. Right, that I'll agree it's hard to do migration of a VM which uses an SRIOV device; and while I think it should be possible to bond a virtio device to a VF for networking and then hotplug the SR-IOV device I agree it's hard to manage. > For other kind of devices, it's hard to work. > We are also adding migration support for QAT(QuickAssist Technology) device. > > QAT device user case introduction. > Server, networking, big data, and storage applications use QuickAssist > Technology to offload servers from handling compute-intensive operations, > such as: > 1) Symmetric cryptography functions including cipher operations and > authentication operations > 2) Public key functions including RSA, Diffie-Hellman, and elliptic curve > cryptography > 3) Compression and decompression functions including DEFLATE and LZS > > PCI hotplug will not work for such devices during migration and these > operations will fail when unplug device. I don't understand that QAT argument; if the device is purely an offload engine for performance, then why can't you fall back to doing the same operations in the VM or in QEMU if the card is unavailable? The tricky bit is dealing with outstanding operations. > So we are trying implementing a new solution which really migrates > device state to target machine and won't affect user during migration > with low service down time. Right, that's a good aim - the only question is how to do it. It looks like this is always going to need some device-specific code; the question I see is whether that's in: 1) qemu 2) the host kernel 3) the guest kernel driver The objections to this series seem to be that it needs changes to (3); I can see the worry that the guest kernel driver might not get a chance to run during the right time in migration and it's painful having to change every guest driver (although your change is small). My question is what stage of the migration process do you expect to tell the guest kernel driver to do this? If you do it at the start of the migration, and quiesce the device, the migration might take a long time (say 30 minutes) - are you intending the device to be quiesced for this long? And where are you going to send the traffic? If you are, then do you need to do it via this PCI trick, or could you just do it via something higher level to quiesce the device. Or are you intending to do it just near the end of the migration? But then how do we know how long it will take the guest driver to respond? It would be great if we could avoid changing the guest; but at least your guest driver changes don't actually seem to be that hardware specific; could your changes actually be moved to generic PCI level so they could be made to work for lots of drivers? Dave -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html