Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Wed, Dec 9, 2015 at 1:28 AM, Lan, Tianyu wrote: > > > On 12/8/2015 1:12 AM, Alexander Duyck wrote: >> >> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu wrote: >>> >>> On 12/5/2015 1:07 AM, Alexander Duyck wrote: > > > > We still need to support Windows guest for migration and this is why > our > patches keep all changes in the driver since it's impossible to change > Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. >>> >>> >>> >>> This totally depends on the NIC or other devices' vendors and they >>> should make decision to support migration or not. If yes, they would >>> modify driver. >> >> >> Having to modify every driver that wants to support live migration is >> a bit much. In addition I don't see this being limited only to NIC >> devices. You can direct assign a number of different devices, your >> solution cannot be specific to NICs. > > > We are also adding such migration support for QAT device and so our > solution will not just be limit to NIC. Now just is the beginning. Agreed, but still QAT is networking related. My advice would be to look at something else that works from within a different subsystem such as storage. All I am saying is that your solution is very networking centric. > We can't limit user to only use Linux guest. So the migration feature > should work for both Windows and Linux guest. Right now what your solution is doing is to limit things so that only the Intel NICs can support this since it will require driver modification across the board. Instead what I have proposed should make it so that once you have done the work there should be very little work that has to be done on your port to support any device. >> >>> If just target to call suspend/resume during migration, the feature will >>> be meaningless. Most cases don't want to affect user during migration >>> a lot and so the service down time is vital. Our target is to apply >>> SRIOV NIC passthough to cloud service and NFV(network functions >>> virtualization) projects which are sensitive to network performance >>> and stability. From my opinion, We should give a change for device >>> driver to implement itself migration job. Call suspend and resume >>> callback in the driver if it doesn't care the performance during >>> migration. >> >> >> The suspend/resume callback should be efficient in terms of time. >> After all we don't want the system to stall for a long period of time >> when it should be either running or asleep. Having it burn cycles in >> a power state limbo doesn't do anyone any good. If nothing else maybe >> it will help to push the vendors to speed up those functions which >> then benefit migration and the system sleep states. > > > If we can benefit both migration and suspend, that would be wonderful. > But migration and system pm is still different. Just for example, > driver doesn't need to put device into deep D-status during migration > and host can do this after migration while it's essential for > system sleep. PCI configure space and interrupt config is emulated by > Qemu and Qemu can migrate these configures to new machine. Driver > doesn't need to deal with such thing. So I think migration still needs a > different callback or different code path than device suspend/resume. SR-IOV devices are considered to be in D3 as soon as you clear the bus master enable bit. They don't actually have a PCIe power management block in their configuration space. The advantage of the suspend/resume approach is that the D0->D3->D0 series of transitions should trigger a PCIe reset on the device. As such the resume call is capable of fully reinitializing a device. As far as migrating the interrupts themselves moving live interrupts is problematic. You are more likely to throw them out of sync since the state of the device will not match the state of what you migrated for things like the pending bit array so if there is a device that actually depending on those bits you might run into issues. > Another concern is that we have to rework PM core ore PCI bus driver > to call suspend/resume for passthrough devices during migration. This > also blocks new feature works on the Windows. If I am not mistaken the Windows drivers have a similar feature that is called when you disable or enable an interface. I believe the motivation for using D3 when a device has been disabled is to save power on the system since in D3 the device should be in its lowest power state. >> >> Also you keep assuming you can keep the device running while you do >> the migration and you can't. You are going to corrupt the memory if >> you do, and you have yet to provide a
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/9/2015 7:28 PM, Michael S. Tsirkin wrote: I remember reading that it's possible to implement a bus driver on windows if required. But basically I don't see how windows can be relevant to discussing guest driver patches. That discussion probably belongs on the qemu maling list, not on lkml. I am not sure whether we can write a bus driver for Windows to support migration. But I think device vendor who want to support migration will improve their driver if we provide such framework in the hypervisor which just need them to change their driver.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Wed, Dec 09, 2015 at 07:19:15PM +0800, Lan, Tianyu wrote: > On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote: > >On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote: > >>Hi Michael & Alexander: > >>Thanks a lot for your comments and suggestions. > > > >It's nice that it's appreciated, but you then go on and ignore > >all that I have written here: > >https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html > > > > No, I will reply it separately and according your suggestion to snip it into > 3 thread. > > >>We still need to support Windows guest for migration and this is why our > >>patches keep all changes in the driver since it's impossible to change > >>Windows kernel. > > > >This is not a reasonable argument. It makes no sense to duplicate code > >on Linux because you must duplicate code on Windows. Let's assume you > >must do it in the driver on windows because windows has closed source > >drivers. What does it matter? Linux can still do it as part of DMA API > >and have it apply to all drivers. > > > > Sure. Duplicated code should be encapsulated and make it able to reuse > by other drivers. Just like you said the dummy write part. > > I meant the framework should not require to change Windows kernel code > (such as PM core or PCI bus driver)and this will block implementation on > the Windows. I remember reading that it's possible to implement a bus driver on windows if required. But basically I don't see how windows can be relevant to discussing guest driver patches. That discussion probably belongs on the qemu maling list, not on lkml. > I think it's not problem to duplicate code in the Windows drivers. > > >>Following is my idea to do DMA tracking. > >> > >>Inject event to VF driver after memory iterate stage > >>and before stop VCPU and then VF driver marks dirty all > >>using DMA memory. The new allocated pages also need to > >>be marked dirty before stopping VCPU. All dirty memory > >>in this time slot will be migrated until stop-and-copy > >>stage. We also need to make sure to disable VF via clearing the > >>bus master enable bit for VF before migrating these memory. > >> > >>The dma page allocated by VF driver also needs to reserve space > >>to do dummy write. > > > >I suggested ways to do it all in the hypervisor without driver hacks, or > >hide it within DMA API without need to reserve extra space. Both > >approaches seem much cleaner. > > > > This sounds reasonable. We may discuss it detail in the separate thread.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/9/2015 6:37 PM, Michael S. Tsirkin wrote: On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote: Hi Michael & Alexander: Thanks a lot for your comments and suggestions. It's nice that it's appreciated, but you then go on and ignore all that I have written here: https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html No, I will reply it separately and according your suggestion to snip it into 3 thread. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. This is not a reasonable argument. It makes no sense to duplicate code on Linux because you must duplicate code on Windows. Let's assume you must do it in the driver on windows because windows has closed source drivers. What does it matter? Linux can still do it as part of DMA API and have it apply to all drivers. Sure. Duplicated code should be encapsulated and make it able to reuse by other drivers. Just like you said the dummy write part. I meant the framework should not require to change Windows kernel code (such as PM core or PCI bus driver)and this will block implementation on the Windows. I think it's not problem to duplicate code in the Windows drivers. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. I suggested ways to do it all in the hypervisor without driver hacks, or hide it within DMA API without need to reserve extra space. Both approaches seem much cleaner. This sounds reasonable. We may discuss it detail in the separate thread.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Sat, Dec 05, 2015 at 12:32:00AM +0800, Lan, Tianyu wrote: > Hi Michael & Alexander: > Thanks a lot for your comments and suggestions. It's nice that it's appreciated, but you then go on and ignore all that I have written here: https://www.mail-archive.com/kvm@vger.kernel.org/msg123826.html > We still need to support Windows guest for migration and this is why our > patches keep all changes in the driver since it's impossible to change > Windows kernel. This is not a reasonable argument. It makes no sense to duplicate code on Linux because you must duplicate code on Windows. Let's assume you must do it in the driver on windows because windows has closed source drivers. What does it matter? Linux can still do it as part of DMA API and have it apply to all drivers. > Following is my idea to do DMA tracking. > > Inject event to VF driver after memory iterate stage > and before stop VCPU and then VF driver marks dirty all > using DMA memory. The new allocated pages also need to > be marked dirty before stopping VCPU. All dirty memory > in this time slot will be migrated until stop-and-copy > stage. We also need to make sure to disable VF via clearing the > bus master enable bit for VF before migrating these memory. > > The dma page allocated by VF driver also needs to reserve space > to do dummy write. I suggested ways to do it all in the hypervisor without driver hacks, or hide it within DMA API without need to reserve extra space. Both approaches seem much cleaner. -- MST
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/8/2015 1:12 AM, Alexander Duyck wrote: On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu wrote: On 12/5/2015 1:07 AM, Alexander Duyck wrote: We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. Having to modify every driver that wants to support live migration is a bit much. In addition I don't see this being limited only to NIC devices. You can direct assign a number of different devices, your solution cannot be specific to NICs. We are also adding such migration support for QAT device and so our solution will not just be limit to NIC. Now just is the beginning. We can't limit user to only use Linux guest. So the migration feature should work for both Windows and Linux guest. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. The suspend/resume callback should be efficient in terms of time. After all we don't want the system to stall for a long period of time when it should be either running or asleep. Having it burn cycles in a power state limbo doesn't do anyone any good. If nothing else maybe it will help to push the vendors to speed up those functions which then benefit migration and the system sleep states. If we can benefit both migration and suspend, that would be wonderful. But migration and system pm is still different. Just for example, driver doesn't need to put device into deep D-status during migration and host can do this after migration while it's essential for system sleep. PCI configure space and interrupt config is emulated by Qemu and Qemu can migrate these configures to new machine. Driver doesn't need to deal with such thing. So I think migration still needs a different callback or different code path than device suspend/resume. Another concern is that we have to rework PM core ore PCI bus driver to call suspend/resume for passthrough devices during migration. This also blocks new feature works on the Windows. Also you keep assuming you can keep the device running while you do the migration and you can't. You are going to corrupt the memory if you do, and you have yet to provide any means to explain how you are going to solve that. The main problem is tracking DMA issue. I will repose my solution in the new thread for discussion. If not way to mark DMA page dirty when DMA is enabled, we have to stop DMA for a small time to do that at the last stage. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The ordering of your explanation here doesn't quite work. What needs to happen is that you have to disable DMA and then mark the pages as dirty. What the disabling of the BME does is signal to the hypervisor that the device is now stopped. The ixgbevf_suspend call already supported by the driver is almost exactly what is needed to take care of something like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. You can't and it will still race. What concerns me is that your patches and the document you referenced earlier show a considerable lack of understanding about how DMA and device drivers work. There is a reason why device drivers have so many memory barriers and the like in them. The fact is when you have CPU and a device both accessing memory things have to be done in a very specific order and you cannot violate that. If you have a contiguous block of memory you expect the device to write into you cannot just poke a hole in it. Such a situation
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Mon, Dec 7, 2015 at 9:39 AM, Michael S. Tsirkin wrote: > On Mon, Dec 07, 2015 at 09:12:08AM -0800, Alexander Duyck wrote: >> On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu wrote: >> > On 12/5/2015 1:07 AM, Alexander Duyck wrote: >> > If can't do that, we have to stop DMA in a short time to mark all dma >> > pages dirty and then reenable it. I am not sure how much we can get by >> > this way to track all DMA memory with device running during migration. I >> > need to do some tests and compare results with stop DMA diretly at last >> > stage during migration. >> >> We have to halt the DMA before we can complete the migration. So >> please feel free to test this. >> >> In addition I still feel you would be better off taking this in >> smaller steps. I still say your first step would be to come up with a >> generic solution for the dirty page tracking like the dma_mark_clean() >> approach I had mentioned earlier. If I get time I might try to take >> care of it myself later this week since you don't seem to agree with >> that approach. > > Or even try to look at the dirty bit in the VT-D PTEs > on the host. See the mail I have just sent. > Might be slower, or might be faster, but is completely > transparent. I just saw it and I am looking over the VTd spec now. It looks like there might be some performance impacts if software is changing the PTEs since then the VTd harwdare cannot cache them. I still have to do some more reading though so I can fully understand the impacts. >> >> >> >> The question is how we would go about triggering it. I really don't >> >> think the PCI configuration space approach is the right idea. >> >> I wonder >> >> if we couldn't get away with some sort of ACPI event instead. We >> >> already require ACPI support in order to shut down the system >> >> gracefully, I wonder if we couldn't get away with something similar in >> >> order to suspend/resume the direct assigned devices gracefully. >> >> >> > >> > I don't think there is such events in the current spec. >> > Otherwise, There are two kinds of suspend/resume callbacks. >> > 1) System suspend/resume called during S2RAM and S2DISK. >> > 2) Runtime suspend/resume called by pm core when device is idle. >> > If you want to do what you mentioned, you have to change PM core and >> > ACPI spec. >> >> The thought I had was to somehow try to move the direct assigned >> devices into their own power domain and then simulate a AC power event >> where that domain is switched off. However I don't know if there are >> ACPI events to support that since the power domain code currently only >> appears to be in use for runtime power management. >> >> That had also given me the thought to look at something like runtime >> power management for the VFs. We would need to do a runtime >> suspend/resume. The only problem is I don't know if there is any way >> to get the VFs to do a quick wakeup. It might be worthwhile looking >> at trying to check with the ACPI experts out there to see if there is >> anything we can do as bypassing having to use the configuration space >> mechanism to signal this would definitely be worth it. > > I don't much like this idea because it relies on the > device being exactly the same across source/destination. > After all, this is always true for suspend/resume. > Most users do not have control over this, and you would > often get sightly different versions of firmware, > etc without noticing. The original code was operating on that assumption as well. That is kind of why I suggested suspend/resume rather than reinventing the wheel. > I think we should first see how far along we can get > by doing a full device reset, and only carrying over > high level state such as IP, MAC, ARP cache etc. One advantage of the suspend/resume approach is that it is compatible with a full reset. The suspend/resume approach assumes the device goes through a D0->D3->D0 reset as a part of transitioning between the system states. I do admit though that the PCI spec says you aren't supposed to be hot-swapping devices while the system is in a sleep state so odds are you would encounter issues if the device changed in any significant way. >> >>> The dma page allocated by VF driver also needs to reserve space >> >>> to do dummy write. >> >> >> >> >> >> No, this will not work. If for example you have a VF driver allocating >> >> memory for a 9K receive how will that work? It isn't as if you can poke >> >> a hole in the contiguous memory. >> >> This is the bit that makes your "poke a hole" solution not portable to >> other drivers. I don't know if you overlooked it but for many NICs >> jumbo frames means using large memory allocations to receive the data. >> That is the way ixgbevf was up until about a year ago so you cannot >> expect all the drivers that will want migration support to allow a >> space for you to write to. In addition some storage drivers have to >> map an entire page, that means there is no room for a hole the
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Mon, Dec 07, 2015 at 09:12:08AM -0800, Alexander Duyck wrote: > On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu wrote: > > On 12/5/2015 1:07 AM, Alexander Duyck wrote: > >>> > >>> > >>> We still need to support Windows guest for migration and this is why our > >>> patches keep all changes in the driver since it's impossible to change > >>> Windows kernel. > >> > >> > >> That is a poor argument. I highly doubt Microsoft is interested in > >> having to modify all of the drivers that will support direct assignment > >> in order to support migration. They would likely request something > >> similar to what I have in that they will want a way to do DMA tracking > >> with minimal modification required to the drivers. > > > > > > This totally depends on the NIC or other devices' vendors and they > > should make decision to support migration or not. If yes, they would > > modify driver. > > Having to modify every driver that wants to support live migration is > a bit much. In addition I don't see this being limited only to NIC > devices. You can direct assign a number of different devices, your > solution cannot be specific to NICs. > > > If just target to call suspend/resume during migration, the feature will > > be meaningless. Most cases don't want to affect user during migration > > a lot and so the service down time is vital. Our target is to apply > > SRIOV NIC passthough to cloud service and NFV(network functions > > virtualization) projects which are sensitive to network performance > > and stability. From my opinion, We should give a change for device > > driver to implement itself migration job. Call suspend and resume > > callback in the driver if it doesn't care the performance during migration. > > The suspend/resume callback should be efficient in terms of time. > After all we don't want the system to stall for a long period of time > when it should be either running or asleep. Having it burn cycles in > a power state limbo doesn't do anyone any good. If nothing else maybe > it will help to push the vendors to speed up those functions which > then benefit migration and the system sleep states. > > Also you keep assuming you can keep the device running while you do > the migration and you can't. You are going to corrupt the memory if > you do, and you have yet to provide any means to explain how you are > going to solve that. > > > > > >> > >>> Following is my idea to do DMA tracking. > >>> > >>> Inject event to VF driver after memory iterate stage > >>> and before stop VCPU and then VF driver marks dirty all > >>> using DMA memory. The new allocated pages also need to > >>> be marked dirty before stopping VCPU. All dirty memory > >>> in this time slot will be migrated until stop-and-copy > >>> stage. We also need to make sure to disable VF via clearing the > >>> bus master enable bit for VF before migrating these memory. > >> > >> > >> The ordering of your explanation here doesn't quite work. What needs to > >> happen is that you have to disable DMA and then mark the pages as dirty. > >> What the disabling of the BME does is signal to the hypervisor that > >> the device is now stopped. The ixgbevf_suspend call already supported > >> by the driver is almost exactly what is needed to take care of something > >> like this. > > > > > > This is why I hope to reserve a piece of space in the dma page to do dummy > > write. This can help to mark page dirty while not require to stop DMA and > > not race with DMA data. > > You can't and it will still race. What concerns me is that your > patches and the document you referenced earlier show a considerable > lack of understanding about how DMA and device drivers work. There is > a reason why device drivers have so many memory barriers and the like > in them. The fact is when you have CPU and a device both accessing > memory things have to be done in a very specific order and you cannot > violate that. > > If you have a contiguous block of memory you expect the device to > write into you cannot just poke a hole in it. Such a situation is not > supported by any hardware that I am aware of. > > As far as writing to dirty the pages it only works so long as you halt > the DMA and then mark the pages dirty. It has to be in that order. > Any other order will result in data corruption and I am sure the NFV > customers definitely don't want that. > > > If can't do that, we have to stop DMA in a short time to mark all dma > > pages dirty and then reenable it. I am not sure how much we can get by > > this way to track all DMA memory with device running during migration. I > > need to do some tests and compare results with stop DMA diretly at last > > stage during migration. > > We have to halt the DMA before we can complete the migration. So > please feel free to test this. > > In addition I still feel you would be better off taking this in > smaller steps. I still say your first step would be to come up with a > generic solution for the dirty
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Mon, Dec 7, 2015 at 7:40 AM, Lan, Tianyu wrote: > On 12/5/2015 1:07 AM, Alexander Duyck wrote: >>> >>> >>> We still need to support Windows guest for migration and this is why our >>> patches keep all changes in the driver since it's impossible to change >>> Windows kernel. >> >> >> That is a poor argument. I highly doubt Microsoft is interested in >> having to modify all of the drivers that will support direct assignment >> in order to support migration. They would likely request something >> similar to what I have in that they will want a way to do DMA tracking >> with minimal modification required to the drivers. > > > This totally depends on the NIC or other devices' vendors and they > should make decision to support migration or not. If yes, they would > modify driver. Having to modify every driver that wants to support live migration is a bit much. In addition I don't see this being limited only to NIC devices. You can direct assign a number of different devices, your solution cannot be specific to NICs. > If just target to call suspend/resume during migration, the feature will > be meaningless. Most cases don't want to affect user during migration > a lot and so the service down time is vital. Our target is to apply > SRIOV NIC passthough to cloud service and NFV(network functions > virtualization) projects which are sensitive to network performance > and stability. From my opinion, We should give a change for device > driver to implement itself migration job. Call suspend and resume > callback in the driver if it doesn't care the performance during migration. The suspend/resume callback should be efficient in terms of time. After all we don't want the system to stall for a long period of time when it should be either running or asleep. Having it burn cycles in a power state limbo doesn't do anyone any good. If nothing else maybe it will help to push the vendors to speed up those functions which then benefit migration and the system sleep states. Also you keep assuming you can keep the device running while you do the migration and you can't. You are going to corrupt the memory if you do, and you have yet to provide any means to explain how you are going to solve that. > >> >>> Following is my idea to do DMA tracking. >>> >>> Inject event to VF driver after memory iterate stage >>> and before stop VCPU and then VF driver marks dirty all >>> using DMA memory. The new allocated pages also need to >>> be marked dirty before stopping VCPU. All dirty memory >>> in this time slot will be migrated until stop-and-copy >>> stage. We also need to make sure to disable VF via clearing the >>> bus master enable bit for VF before migrating these memory. >> >> >> The ordering of your explanation here doesn't quite work. What needs to >> happen is that you have to disable DMA and then mark the pages as dirty. >> What the disabling of the BME does is signal to the hypervisor that >> the device is now stopped. The ixgbevf_suspend call already supported >> by the driver is almost exactly what is needed to take care of something >> like this. > > > This is why I hope to reserve a piece of space in the dma page to do dummy > write. This can help to mark page dirty while not require to stop DMA and > not race with DMA data. You can't and it will still race. What concerns me is that your patches and the document you referenced earlier show a considerable lack of understanding about how DMA and device drivers work. There is a reason why device drivers have so many memory barriers and the like in them. The fact is when you have CPU and a device both accessing memory things have to be done in a very specific order and you cannot violate that. If you have a contiguous block of memory you expect the device to write into you cannot just poke a hole in it. Such a situation is not supported by any hardware that I am aware of. As far as writing to dirty the pages it only works so long as you halt the DMA and then mark the pages dirty. It has to be in that order. Any other order will result in data corruption and I am sure the NFV customers definitely don't want that. > If can't do that, we have to stop DMA in a short time to mark all dma > pages dirty and then reenable it. I am not sure how much we can get by > this way to track all DMA memory with device running during migration. I > need to do some tests and compare results with stop DMA diretly at last > stage during migration. We have to halt the DMA before we can complete the migration. So please feel free to test this. In addition I still feel you would be better off taking this in smaller steps. I still say your first step would be to come up with a generic solution for the dirty page tracking like the dma_mark_clean() approach I had mentioned earlier. If I get time I might try to take care of it myself later this week since you don't seem to agree with that approach. >> >> The question is how we would go about triggering it. I really don
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/5/2015 1:07 AM, Alexander Duyck wrote: We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. This totally depends on the NIC or other devices' vendors and they should make decision to support migration or not. If yes, they would modify driver. If just target to call suspend/resume during migration, the feature will be meaningless. Most cases don't want to affect user during migration a lot and so the service down time is vital. Our target is to apply SRIOV NIC passthough to cloud service and NFV(network functions virtualization) projects which are sensitive to network performance and stability. From my opinion, We should give a change for device driver to implement itself migration job. Call suspend and resume callback in the driver if it doesn't care the performance during migration. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The ordering of your explanation here doesn't quite work. What needs to happen is that you have to disable DMA and then mark the pages as dirty. What the disabling of the BME does is signal to the hypervisor that the device is now stopped. The ixgbevf_suspend call already supported by the driver is almost exactly what is needed to take care of something like this. This is why I hope to reserve a piece of space in the dma page to do dummy write. This can help to mark page dirty while not require to stop DMA and not race with DMA data. If can't do that, we have to stop DMA in a short time to mark all dma pages dirty and then reenable it. I am not sure how much we can get by this way to track all DMA memory with device running during migration. I need to do some tests and compare results with stop DMA diretly at last stage during migration. The question is how we would go about triggering it. I really don't think the PCI configuration space approach is the right idea. I wonder if we couldn't get away with some sort of ACPI event instead. We already require ACPI support in order to shut down the system gracefully, I wonder if we couldn't get away with something similar in order to suspend/resume the direct assigned devices gracefully. I don't think there is such events in the current spec. Otherwise, There are two kinds of suspend/resume callbacks. 1) System suspend/resume called during S2RAM and S2DISK. 2) Runtime suspend/resume called by pm core when device is idle. If you want to do what you mentioned, you have to change PM core and ACPI spec. The dma page allocated by VF driver also needs to reserve space to do dummy write. No, this will not work. If for example you have a VF driver allocating memory for a 9K receive how will that work? It isn't as if you can poke a hole in the contiguous memory.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/04/2015 08:32 AM, Lan, Tianyu wrote: Hi Michael & Alexander: Thanks a lot for your comments and suggestions. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. That is a poor argument. I highly doubt Microsoft is interested in having to modify all of the drivers that will support direct assignment in order to support migration. They would likely request something similar to what I have in that they will want a way to do DMA tracking with minimal modification required to the drivers. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The ordering of your explanation here doesn't quite work. What needs to happen is that you have to disable DMA and then mark the pages as dirty. What the disabling of the BME does is signal to the hypervisor that the device is now stopped. The ixgbevf_suspend call already supported by the driver is almost exactly what is needed to take care of something like this. The question is how we would go about triggering it. I really don't think the PCI configuration space approach is the right idea. I wonder if we couldn't get away with some sort of ACPI event instead. We already require ACPI support in order to shut down the system gracefully, I wonder if we couldn't get away with something similar in order to suspend/resume the direct assigned devices gracefully. The dma page allocated by VF driver also needs to reserve space to do dummy write. No, this will not work. If for example you have a VF driver allocating memory for a 9K receive how will that work? It isn't as if you can poke a hole in the contiguous memory.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
Hi Michael & Alexander: Thanks a lot for your comments and suggestions. We still need to support Windows guest for migration and this is why our patches keep all changes in the driver since it's impossible to change Windows kernel. Following is my idea to do DMA tracking. Inject event to VF driver after memory iterate stage and before stop VCPU and then VF driver marks dirty all using DMA memory. The new allocated pages also need to be marked dirty before stopping VCPU. All dirty memory in this time slot will be migrated until stop-and-copy stage. We also need to make sure to disable VF via clearing the bus master enable bit for VF before migrating these memory. The dma page allocated by VF driver also needs to reserve space to do dummy write. On 12/2/2015 7:44 PM, Michael S. Tsirkin wrote: On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote: On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin wrote: On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: There are several components to this: - dma_map_* needs to prevent page from being migrated while device is running. For example, expose some kind of bitmap from guest to host, set bit there while page is mapped. What happens if we stop the guest and some bits are still set? See dma_alloc_coherent below for some ideas. Yeah, I could see something like this working. Maybe we could do something like what was done for the NX bit and make use of the upper order bits beyond the limits of the memory range to mark pages as non-migratable? I'm curious. What we have with a DMA mapped region is essentially shared memory between the guest and the device. How would we resolve something like this with IVSHMEM, or are we blocked there as well in terms of migration? I have some ideas. Will post later. I look forward to it. - dma_unmap_* needs to mark page as dirty This can be done by writing into a page. - dma_sync_* needs to mark page as dirty This is trickier as we can not change the data. One solution is using atomics. For example: int x = ACCESS_ONCE(*p); cmpxchg(p, x, x); Seems to do a write without changing page contents. Like I said we can probably kill 2 birds with one stone by just implementing our own dma_mark_clean() for x86 virtualized environments. I'd say we could take your solution one step further and just use 0 instead of bothering to read the value. After all it won't write the area if the value at the offset is not 0. Really almost any atomic that has no side effect will do. atomic or with 0 atomic and with It's just that cmpxchg already happens to have a portable wrapper. I was originally thinking maybe an atomic_add with 0 would be the way to go. cmpxchg with any value too. Either way though we still are using a locked prefix and having to dirty a cache line per page which is going to come at some cost. I agree. It's likely not necessary for everyone to be doing this: only people that both run within the VM and want migration to work need to do this logging. So set some module option to have driver tell hypervisor that it supports logging. If bus mastering is enabled before this, migration is blocked. Or even pass some flag from hypervisor so driver can detect it needs to log writes. I guess this could be put in device config somewhere, though in practice it's a global thing, not a per device one, so maybe we need some new channel to pass this flag to guest. CPUID? Or maybe we can put some kind of agent in the initrd and use the existing guest agent channel after all. agent in initrd could open up a lot of new possibilities. - dma_alloc_coherent memory (e.g. device rings) must be migrated after device stopped modifying it. Just stopping the VCPU is not enough: you must make sure device is not changing it. Or maybe the device has some kind of ring flush operation, if there was a reasonably portable way to do this (e.g. a flush capability could maybe be added to SRIOV) then hypervisor could do this. This is where things start to get messy. I was suggesting the suspend/resume to resolve this bit, but it might be possible to also deal with this via something like this via clearing the bus master enable bit for the VF. If I am not mistaken that should disable MSI-X interrupts and halt any DMA. That should work as long as you have some mechanism that is tracking the pages in use for DMA. A bigger issue is recovering afterwards. Agreed. In case you need to resume on source, you really need to follow the same path as on destination, preferably detecting device reset and restoring the device state. The problem with detecting the reset is that you would likely have to be polling to do something like that. We could some event to guest to notify it about this event through a new or existing channel. Or we could make i
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Tue, Dec 01, 2015 at 10:36:33AM -0800, Alexander Duyck wrote: > On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin wrote: > > On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: > >> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: > > >> > There are several components to this: > >> > - dma_map_* needs to prevent page from > >> > being migrated while device is running. > >> > For example, expose some kind of bitmap from guest > >> > to host, set bit there while page is mapped. > >> > What happens if we stop the guest and some > >> > bits are still set? See dma_alloc_coherent below > >> > for some ideas. > >> > >> Yeah, I could see something like this working. Maybe we could do > >> something like what was done for the NX bit and make use of the upper > >> order bits beyond the limits of the memory range to mark pages as > >> non-migratable? > >> > >> I'm curious. What we have with a DMA mapped region is essentially > >> shared memory between the guest and the device. How would we resolve > >> something like this with IVSHMEM, or are we blocked there as well in > >> terms of migration? > > > > I have some ideas. Will post later. > > I look forward to it. > > >> > - dma_unmap_* needs to mark page as dirty > >> > This can be done by writing into a page. > >> > > >> > - dma_sync_* needs to mark page as dirty > >> > This is trickier as we can not change the data. > >> > One solution is using atomics. > >> > For example: > >> > int x = ACCESS_ONCE(*p); > >> > cmpxchg(p, x, x); > >> > Seems to do a write without changing page > >> > contents. > >> > >> Like I said we can probably kill 2 birds with one stone by just > >> implementing our own dma_mark_clean() for x86 virtualized > >> environments. > >> > >> I'd say we could take your solution one step further and just use 0 > >> instead of bothering to read the value. After all it won't write the > >> area if the value at the offset is not 0. > > > > Really almost any atomic that has no side effect will do. > > atomic or with 0 > > atomic and with > > > > It's just that cmpxchg already happens to have a portable > > wrapper. > > I was originally thinking maybe an atomic_add with 0 would be the way > to go. cmpxchg with any value too. > Either way though we still are using a locked prefix and > having to dirty a cache line per page which is going to come at some > cost. I agree. It's likely not necessary for everyone to be doing this: only people that both run within the VM and want migration to work need to do this logging. So set some module option to have driver tell hypervisor that it supports logging. If bus mastering is enabled before this, migration is blocked. Or even pass some flag from hypervisor so driver can detect it needs to log writes. I guess this could be put in device config somewhere, though in practice it's a global thing, not a per device one, so maybe we need some new channel to pass this flag to guest. CPUID? Or maybe we can put some kind of agent in the initrd and use the existing guest agent channel after all. agent in initrd could open up a lot of new possibilities. > >> > - dma_alloc_coherent memory (e.g. device rings) > >> > must be migrated after device stopped modifying it. > >> > Just stopping the VCPU is not enough: > >> > you must make sure device is not changing it. > >> > > >> > Or maybe the device has some kind of ring flush operation, > >> > if there was a reasonably portable way to do this > >> > (e.g. a flush capability could maybe be added to SRIOV) > >> > then hypervisor could do this. > >> > >> This is where things start to get messy. I was suggesting the > >> suspend/resume to resolve this bit, but it might be possible to also > >> deal with this via something like this via clearing the bus master > >> enable bit for the VF. If I am not mistaken that should disable MSI-X > >> interrupts and halt any DMA. That should work as long as you have > >> some mechanism that is tracking the pages in use for DMA. > > > > A bigger issue is recovering afterwards. > > Agreed. > > >> > In case you need to resume on source, you > >> > really need to follow the same path > >> > as on destination, preferably detecting > >> > device reset and restoring the device > >> > state. > >> > >> The problem with detecting the reset is that you would likely have to > >> be polling to do something like that. > > > > We could some event to guest to notify it about this event > > through a new or existing channel. > > > > Or we could make it possible for userspace to trigger this, > > then notify guest through the guest agent. > > The first thing that comes to mind would be to use something like PCIe > Advanced Error Reporting, however I don't know if we can put a > requirement on the system supporting the q35 machine type or not in > order to support migration. You mean require pci express? This sounds quite reasonable. > >> I beli
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Tue, Dec 1, 2015 at 9:37 AM, Michael S. Tsirkin wrote: > On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: >> On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: >> > There are several components to this: >> > - dma_map_* needs to prevent page from >> > being migrated while device is running. >> > For example, expose some kind of bitmap from guest >> > to host, set bit there while page is mapped. >> > What happens if we stop the guest and some >> > bits are still set? See dma_alloc_coherent below >> > for some ideas. >> >> Yeah, I could see something like this working. Maybe we could do >> something like what was done for the NX bit and make use of the upper >> order bits beyond the limits of the memory range to mark pages as >> non-migratable? >> >> I'm curious. What we have with a DMA mapped region is essentially >> shared memory between the guest and the device. How would we resolve >> something like this with IVSHMEM, or are we blocked there as well in >> terms of migration? > > I have some ideas. Will post later. I look forward to it. >> > - dma_unmap_* needs to mark page as dirty >> > This can be done by writing into a page. >> > >> > - dma_sync_* needs to mark page as dirty >> > This is trickier as we can not change the data. >> > One solution is using atomics. >> > For example: >> > int x = ACCESS_ONCE(*p); >> > cmpxchg(p, x, x); >> > Seems to do a write without changing page >> > contents. >> >> Like I said we can probably kill 2 birds with one stone by just >> implementing our own dma_mark_clean() for x86 virtualized >> environments. >> >> I'd say we could take your solution one step further and just use 0 >> instead of bothering to read the value. After all it won't write the >> area if the value at the offset is not 0. > > Really almost any atomic that has no side effect will do. > atomic or with 0 > atomic and with > > It's just that cmpxchg already happens to have a portable > wrapper. I was originally thinking maybe an atomic_add with 0 would be the way to go. Either way though we still are using a locked prefix and having to dirty a cache line per page which is going to come at some cost. >> > - dma_alloc_coherent memory (e.g. device rings) >> > must be migrated after device stopped modifying it. >> > Just stopping the VCPU is not enough: >> > you must make sure device is not changing it. >> > >> > Or maybe the device has some kind of ring flush operation, >> > if there was a reasonably portable way to do this >> > (e.g. a flush capability could maybe be added to SRIOV) >> > then hypervisor could do this. >> >> This is where things start to get messy. I was suggesting the >> suspend/resume to resolve this bit, but it might be possible to also >> deal with this via something like this via clearing the bus master >> enable bit for the VF. If I am not mistaken that should disable MSI-X >> interrupts and halt any DMA. That should work as long as you have >> some mechanism that is tracking the pages in use for DMA. > > A bigger issue is recovering afterwards. Agreed. >> > In case you need to resume on source, you >> > really need to follow the same path >> > as on destination, preferably detecting >> > device reset and restoring the device >> > state. >> >> The problem with detecting the reset is that you would likely have to >> be polling to do something like that. > > We could some event to guest to notify it about this event > through a new or existing channel. > > Or we could make it possible for userspace to trigger this, > then notify guest through the guest agent. The first thing that comes to mind would be to use something like PCIe Advanced Error Reporting, however I don't know if we can put a requirement on the system supporting the q35 machine type or not in order to support migration. >> I believe the fm10k driver >> already has code like that in place where it will detect a reset as a >> part of its watchdog, however the response time is something like 2 >> seconds for that. That was one of the reasons I preferred something >> like hot-plug as that should be functioning as soon as the guest is up >> and it is a mechanism that operates outside of the VF drivers. > > That's pretty minor. > A bigger issue is making sure guest does not crash > when device is suddenly reset under it's legs. I know the ixgbevf driver should already have logic to address some of that. If you look through the code there should be logic there for a surprise removal support in ixgbevf. The only issue is that unlike fm10k it will not restore itself after a resume or slot_reset call.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Tue, Dec 01, 2015 at 09:04:32AM -0800, Alexander Duyck wrote: > On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: > > On Tue, Dec 01, 2015 at 11:04:31PM +0800, Lan, Tianyu wrote: > >> > >> > >> On 12/1/2015 12:07 AM, Alexander Duyck wrote: > >> >They can only be corrected if the underlying assumptions are correct > >> >and they aren't. Your solution would have never worked correctly. > >> >The problem is you assume you can keep the device running when you are > >> >migrating and you simply cannot. At some point you will always have > >> >to stop the device in order to complete the migration, and you cannot > >> >stop it before you have stopped your page tracking mechanism. So > >> >unless the platform has an IOMMU that is somehow taking part in the > >> >dirty page tracking you will not be able to stop the guest and then > >> >the device, it will have to be the device and then the guest. > >> > > >> >>>Doing suspend and resume() may help to do migration easily but some > >> >>>devices requires low service down time. Especially network and I got > >> >>>that some cloud company promised less than 500ms network service > >> >>>downtime. > >> >Honestly focusing on the downtime is getting the cart ahead of the > >> >horse. First you need to be able to do this without corrupting system > >> >memory and regardless of the state of the device. You haven't even > >> >gotten to that state yet. Last I knew the device had to be up in > >> >order for your migration to even work. > >> > >> I think the issue is that the content of rx package delivered to stack > >> maybe > >> changed during migration because the piece of memory won't be migrated to > >> new machine. This may confuse applications or stack. Current dummy write > >> solution can ensure the content of package won't change after doing dummy > >> write while the content maybe not received data if migration happens before > >> that point. We can recheck the content via checksum or crc in the protocol > >> after dummy write to ensure the content is what VF received. I think stack > >> has already done such checks and the package will be abandoned if failed to > >> pass through the check. > > > > > > Most people nowdays rely on hardware checksums so I don't think this can > > fly. > > Correct. The checksum/crc approach will not work since it is possible > for a checksum to even be mangled in the case of some features such as > LRO or GRO. > > >> Another way is to tell all memory driver are using to Qemu and let Qemu to > >> migrate these memory after stopping VCPU and the device. This seems safe > >> but > >> implementation maybe complex. > > > > Not really 100% safe. See below. > > > > I think hiding these details behind dma_* API does have > > some appeal. In any case, it gives us a good > > terminology as it covers what most drivers do. > > That was kind of my thought. If we were to build our own > dma_mark_clean() type function that will mark the DMA region dirty on > sync or unmap then that is half the battle right there as we would be > able to at least keep the regions consistent after they have left the > driver. > > > There are several components to this: > > - dma_map_* needs to prevent page from > > being migrated while device is running. > > For example, expose some kind of bitmap from guest > > to host, set bit there while page is mapped. > > What happens if we stop the guest and some > > bits are still set? See dma_alloc_coherent below > > for some ideas. > > Yeah, I could see something like this working. Maybe we could do > something like what was done for the NX bit and make use of the upper > order bits beyond the limits of the memory range to mark pages as > non-migratable? > > I'm curious. What we have with a DMA mapped region is essentially > shared memory between the guest and the device. How would we resolve > something like this with IVSHMEM, or are we blocked there as well in > terms of migration? I have some ideas. Will post later. > > - dma_unmap_* needs to mark page as dirty > > This can be done by writing into a page. > > > > - dma_sync_* needs to mark page as dirty > > This is trickier as we can not change the data. > > One solution is using atomics. > > For example: > > int x = ACCESS_ONCE(*p); > > cmpxchg(p, x, x); > > Seems to do a write without changing page > > contents. > > Like I said we can probably kill 2 birds with one stone by just > implementing our own dma_mark_clean() for x86 virtualized > environments. > > I'd say we could take your solution one step further and just use 0 > instead of bothering to read the value. After all it won't write the > area if the value at the offset is not 0. Really almost any atomic that has no side effect will do. atomic or with 0 atomic and with It's just that cmpxchg already happens to have a portable wrapper. > The only downside is that > this is a locked operation so we will take a pretty serious > p
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Tue, Dec 1, 2015 at 7:28 AM, Michael S. Tsirkin wrote: > On Tue, Dec 01, 2015 at 11:04:31PM +0800, Lan, Tianyu wrote: >> >> >> On 12/1/2015 12:07 AM, Alexander Duyck wrote: >> >They can only be corrected if the underlying assumptions are correct >> >and they aren't. Your solution would have never worked correctly. >> >The problem is you assume you can keep the device running when you are >> >migrating and you simply cannot. At some point you will always have >> >to stop the device in order to complete the migration, and you cannot >> >stop it before you have stopped your page tracking mechanism. So >> >unless the platform has an IOMMU that is somehow taking part in the >> >dirty page tracking you will not be able to stop the guest and then >> >the device, it will have to be the device and then the guest. >> > >> >>>Doing suspend and resume() may help to do migration easily but some >> >>>devices requires low service down time. Especially network and I got >> >>>that some cloud company promised less than 500ms network service downtime. >> >Honestly focusing on the downtime is getting the cart ahead of the >> >horse. First you need to be able to do this without corrupting system >> >memory and regardless of the state of the device. You haven't even >> >gotten to that state yet. Last I knew the device had to be up in >> >order for your migration to even work. >> >> I think the issue is that the content of rx package delivered to stack maybe >> changed during migration because the piece of memory won't be migrated to >> new machine. This may confuse applications or stack. Current dummy write >> solution can ensure the content of package won't change after doing dummy >> write while the content maybe not received data if migration happens before >> that point. We can recheck the content via checksum or crc in the protocol >> after dummy write to ensure the content is what VF received. I think stack >> has already done such checks and the package will be abandoned if failed to >> pass through the check. > > > Most people nowdays rely on hardware checksums so I don't think this can > fly. Correct. The checksum/crc approach will not work since it is possible for a checksum to even be mangled in the case of some features such as LRO or GRO. >> Another way is to tell all memory driver are using to Qemu and let Qemu to >> migrate these memory after stopping VCPU and the device. This seems safe but >> implementation maybe complex. > > Not really 100% safe. See below. > > I think hiding these details behind dma_* API does have > some appeal. In any case, it gives us a good > terminology as it covers what most drivers do. That was kind of my thought. If we were to build our own dma_mark_clean() type function that will mark the DMA region dirty on sync or unmap then that is half the battle right there as we would be able to at least keep the regions consistent after they have left the driver. > There are several components to this: > - dma_map_* needs to prevent page from > being migrated while device is running. > For example, expose some kind of bitmap from guest > to host, set bit there while page is mapped. > What happens if we stop the guest and some > bits are still set? See dma_alloc_coherent below > for some ideas. Yeah, I could see something like this working. Maybe we could do something like what was done for the NX bit and make use of the upper order bits beyond the limits of the memory range to mark pages as non-migratable? I'm curious. What we have with a DMA mapped region is essentially shared memory between the guest and the device. How would we resolve something like this with IVSHMEM, or are we blocked there as well in terms of migration? > - dma_unmap_* needs to mark page as dirty > This can be done by writing into a page. > > - dma_sync_* needs to mark page as dirty > This is trickier as we can not change the data. > One solution is using atomics. > For example: > int x = ACCESS_ONCE(*p); > cmpxchg(p, x, x); > Seems to do a write without changing page > contents. Like I said we can probably kill 2 birds with one stone by just implementing our own dma_mark_clean() for x86 virtualized environments. I'd say we could take your solution one step further and just use 0 instead of bothering to read the value. After all it won't write the area if the value at the offset is not 0. The only downside is that this is a locked operation so we will take a pretty serious performance penalty when this is active. As such my preference would be to hide the code behind some static key that we could then switch on in the event of a VM being migrated. > - dma_alloc_coherent memory (e.g. device rings) > must be migrated after device stopped modifying it. > Just stopping the VCPU is not enough: > you must make sure device is not changing it. > > Or maybe the device has some kind of ring flush operation, > if there was a reasonably portable way to
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Tue, Dec 01, 2015 at 11:04:31PM +0800, Lan, Tianyu wrote: > > > On 12/1/2015 12:07 AM, Alexander Duyck wrote: > >They can only be corrected if the underlying assumptions are correct > >and they aren't. Your solution would have never worked correctly. > >The problem is you assume you can keep the device running when you are > >migrating and you simply cannot. At some point you will always have > >to stop the device in order to complete the migration, and you cannot > >stop it before you have stopped your page tracking mechanism. So > >unless the platform has an IOMMU that is somehow taking part in the > >dirty page tracking you will not be able to stop the guest and then > >the device, it will have to be the device and then the guest. > > > >>>Doing suspend and resume() may help to do migration easily but some > >>>devices requires low service down time. Especially network and I got > >>>that some cloud company promised less than 500ms network service downtime. > >Honestly focusing on the downtime is getting the cart ahead of the > >horse. First you need to be able to do this without corrupting system > >memory and regardless of the state of the device. You haven't even > >gotten to that state yet. Last I knew the device had to be up in > >order for your migration to even work. > > I think the issue is that the content of rx package delivered to stack maybe > changed during migration because the piece of memory won't be migrated to > new machine. This may confuse applications or stack. Current dummy write > solution can ensure the content of package won't change after doing dummy > write while the content maybe not received data if migration happens before > that point. We can recheck the content via checksum or crc in the protocol > after dummy write to ensure the content is what VF received. I think stack > has already done such checks and the package will be abandoned if failed to > pass through the check. Most people nowdays rely on hardware checksums so I don't think this can fly. > Another way is to tell all memory driver are using to Qemu and let Qemu to > migrate these memory after stopping VCPU and the device. This seems safe but > implementation maybe complex. Not really 100% safe. See below. I think hiding these details behind dma_* API does have some appeal. In any case, it gives us a good terminology as it covers what most drivers do. There are several components to this: - dma_map_* needs to prevent page from being migrated while device is running. For example, expose some kind of bitmap from guest to host, set bit there while page is mapped. What happens if we stop the guest and some bits are still set? See dma_alloc_coherent below for some ideas. - dma_unmap_* needs to mark page as dirty This can be done by writing into a page. - dma_sync_* needs to mark page as dirty This is trickier as we can not change the data. One solution is using atomics. For example: int x = ACCESS_ONCE(*p); cmpxchg(p, x, x); Seems to do a write without changing page contents. - dma_alloc_coherent memory (e.g. device rings) must be migrated after device stopped modifying it. Just stopping the VCPU is not enough: you must make sure device is not changing it. Or maybe the device has some kind of ring flush operation, if there was a reasonably portable way to do this (e.g. a flush capability could maybe be added to SRIOV) then hypervisor could do this. With existing devices, either do it after device reset, or disable memory access in the IOMMU. Maybe both. In case you need to resume on source, you really need to follow the same path as on destination, preferably detecting device reset and restoring the device state. A similar approach could work for dma_map_ above. -- MST
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 12/1/2015 12:07 AM, Alexander Duyck wrote: They can only be corrected if the underlying assumptions are correct and they aren't. Your solution would have never worked correctly. The problem is you assume you can keep the device running when you are migrating and you simply cannot. At some point you will always have to stop the device in order to complete the migration, and you cannot stop it before you have stopped your page tracking mechanism. So unless the platform has an IOMMU that is somehow taking part in the dirty page tracking you will not be able to stop the guest and then the device, it will have to be the device and then the guest. >Doing suspend and resume() may help to do migration easily but some >devices requires low service down time. Especially network and I got >that some cloud company promised less than 500ms network service downtime. Honestly focusing on the downtime is getting the cart ahead of the horse. First you need to be able to do this without corrupting system memory and regardless of the state of the device. You haven't even gotten to that state yet. Last I knew the device had to be up in order for your migration to even work. I think the issue is that the content of rx package delivered to stack maybe changed during migration because the piece of memory won't be migrated to new machine. This may confuse applications or stack. Current dummy write solution can ensure the content of package won't change after doing dummy write while the content maybe not received data if migration happens before that point. We can recheck the content via checksum or crc in the protocol after dummy write to ensure the content is what VF received. I think stack has already done such checks and the package will be abandoned if failed to pass through the check. Another way is to tell all memory driver are using to Qemu and let Qemu to migrate these memory after stopping VCPU and the device. This seems safe but implementation maybe complex.
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Sun, Nov 29, 2015 at 10:53 PM, Lan, Tianyu wrote: > On 11/26/2015 11:56 AM, Alexander Duyck wrote: >> >> > I am not saying you cannot modify the drivers, however what you are >> doing is far too invasive. Do you seriously plan on modifying all of >> the PCI device drivers out there in order to allow any device that >> might be direct assigned to a port to support migration? I certainly >> hope not. That is why I have said that this solution will not scale. > > > Current drivers are not migration friendly. If the driver wants to > support migration, it's necessary to be changed. Modifying all of the drivers directly will not solve the issue though. This is why I have suggested looking at possibly implementing something like dma_mark_clean() which is used for ia64 architectures to mark pages that were DMAed in as clean. In your case though you would want to mark such pages as dirty so that the page migration will notice them and move them over. > RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and > DMA tracking during migration. These are common for most drivers and > they maybe problematic in the previous version but can be corrected later. They can only be corrected if the underlying assumptions are correct and they aren't. Your solution would have never worked correctly. The problem is you assume you can keep the device running when you are migrating and you simply cannot. At some point you will always have to stop the device in order to complete the migration, and you cannot stop it before you have stopped your page tracking mechanism. So unless the platform has an IOMMU that is somehow taking part in the dirty page tracking you will not be able to stop the guest and then the device, it will have to be the device and then the guest. > Doing suspend and resume() may help to do migration easily but some > devices requires low service down time. Especially network and I got > that some cloud company promised less than 500ms network service downtime. Honestly focusing on the downtime is getting the cart ahead of the horse. First you need to be able to do this without corrupting system memory and regardless of the state of the device. You haven't even gotten to that state yet. Last I knew the device had to be up in order for your migration to even work. Many devices are very state driven. As such you cannot just freeze them and restore them like you would regular device memory. That is where something like suspend/resume comes in because it already takes care of getting the device ready for halt, and then resume. Keep in mind that those functions were meant to function on a device doing something like a suspend to RAM or disk. This is not too far of from what a migration is doing since you need to halt the guest before you move it. As such the first step is to make it so that we can do the current bonding approach with one change. Specifically we want to leave the device in the guest until the last portion of the migration instead of having to remove it first. To that end I would suggest focusing on solving the DMA problem via something like a dma_mark_clean() type solution as that would be one issue resolved and we all would see an immediate gain instead of just those users of the ixgbevf driver. > So I think performance effect also should be taken into account when we > design the framework. What you are proposing I would call premature optimization. You need to actually solve the problem before you can start optimizing things and I don't see anything actually solved yet since your solution is too unstable. >> >> What I am counter proposing seems like a very simple proposition. It >> can be implemented in two steps. >> >> 1. Look at modifying dma_mark_clean(). It is a function called in >> the sync and unmap paths of the lib/swiotlb.c. If you could somehow >> modify it to take care of marking the pages you unmap for Rx as being >> dirty it will get you a good way towards your goal as it will allow >> you to continue to do DMA while you are migrating the VM. >> >> 2. Look at making use of the existing PCI suspend/resume calls that >> are there to support PCI power management. They have everything >> needed to allow you to pause and resume DMA for the device before and >> after the migration while retaining the driver state. If you can >> implement something that allows you to trigger these calls from the >> PCI subsystem such as hot-plug then you would have a generic solution >> that can be easily reproduced for multiple drivers beyond those >> supported by ixgbevf. > > > Glanced at PCI hotplug code. The hotplug events are triggered by PCI hotplug > controller and these event are defined in the controller spec. > It's hard to extend more events. Otherwise, we also need to add some > specific codes in the PCI hotplug core since it's only add and remove > PCI device when it gets events. It's also a challenge to modify Windows > hotplug codes. So we may
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 11/26/2015 11:56 AM, Alexander Duyck wrote: > I am not saying you cannot modify the drivers, however what you are doing is far too invasive. Do you seriously plan on modifying all of the PCI device drivers out there in order to allow any device that might be direct assigned to a port to support migration? I certainly hope not. That is why I have said that this solution will not scale. Current drivers are not migration friendly. If the driver wants to support migration, it's necessary to be changed. RFC PATCH V1 presented our ideas about how to deal with MMIO, ring and DMA tracking during migration. These are common for most drivers and they maybe problematic in the previous version but can be corrected later. Doing suspend and resume() may help to do migration easily but some devices requires low service down time. Especially network and I got that some cloud company promised less than 500ms network service downtime. So I think performance effect also should be taken into account when we design the framework. What I am counter proposing seems like a very simple proposition. It can be implemented in two steps. 1. Look at modifying dma_mark_clean(). It is a function called in the sync and unmap paths of the lib/swiotlb.c. If you could somehow modify it to take care of marking the pages you unmap for Rx as being dirty it will get you a good way towards your goal as it will allow you to continue to do DMA while you are migrating the VM. 2. Look at making use of the existing PCI suspend/resume calls that are there to support PCI power management. They have everything needed to allow you to pause and resume DMA for the device before and after the migration while retaining the driver state. If you can implement something that allows you to trigger these calls from the PCI subsystem such as hot-plug then you would have a generic solution that can be easily reproduced for multiple drivers beyond those supported by ixgbevf. Glanced at PCI hotplug code. The hotplug events are triggered by PCI hotplug controller and these event are defined in the controller spec. It's hard to extend more events. Otherwise, we also need to add some specific codes in the PCI hotplug core since it's only add and remove PCI device when it gets events. It's also a challenge to modify Windows hotplug codes. So we may need to find another way. Thanks. - Alex
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Wed, Nov 25, 2015 at 7:15 PM, Dong, Eddie wrote: >> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu wrote: >> > On 2015年11月25日 13:30, Alexander Duyck wrote: >> >> No, what I am getting at is that you can't go around and modify the >> >> configuration space for every possible device out there. This >> >> solution won't scale. >> > >> > >> > PCI config space regs are emulation by Qemu and so We can find the >> > free PCI config space regs for the faked PCI capability. Its position >> > can be not permanent. >> >> Yes, but do you really want to edit every driver on every OS that you plan to >> support this on. What about things like direct assignment of regular >> Ethernet >> ports? What you really need is a solution that will work generically on any >> existing piece of hardware out there. > > The fundamental assumption of this patch series is to modify the driver in > guest to self-emulate or track the device state, so that the migration may be > possible. > I don't think we can modify OS, without modifying the drivers, even using the > PCIe hotplug mechanism. > In the meantime, modifying Windows OS is a big challenge given that only > Microsoft can do. While, modifying driver is relatively simple and manageable > to device vendors, if the device vendor want to support state-clone based > migration. The problem is the code you are presenting, even as a proof of concept is seriously flawed. It does a poor job of exposing how any of this can be duplicated for any other VF other than the one you are working on. I am not saying you cannot modify the drivers, however what you are doing is far too invasive. Do you seriously plan on modifying all of the PCI device drivers out there in order to allow any device that might be direct assigned to a port to support migration? I certainly hope not. That is why I have said that this solution will not scale. What I am counter proposing seems like a very simple proposition. It can be implemented in two steps. 1. Look at modifying dma_mark_clean(). It is a function called in the sync and unmap paths of the lib/swiotlb.c. If you could somehow modify it to take care of marking the pages you unmap for Rx as being dirty it will get you a good way towards your goal as it will allow you to continue to do DMA while you are migrating the VM. 2. Look at making use of the existing PCI suspend/resume calls that are there to support PCI power management. They have everything needed to allow you to pause and resume DMA for the device before and after the migration while retaining the driver state. If you can implement something that allows you to trigger these calls from the PCI subsystem such as hot-plug then you would have a generic solution that can be easily reproduced for multiple drivers beyond those supported by ixgbevf. Thanks. - Alex
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
> On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu wrote: > > On 2015年11月25日 13:30, Alexander Duyck wrote: > >> No, what I am getting at is that you can't go around and modify the > >> configuration space for every possible device out there. This > >> solution won't scale. > > > > > > PCI config space regs are emulation by Qemu and so We can find the > > free PCI config space regs for the faked PCI capability. Its position > > can be not permanent. > > Yes, but do you really want to edit every driver on every OS that you plan to > support this on. What about things like direct assignment of regular Ethernet > ports? What you really need is a solution that will work generically on any > existing piece of hardware out there. The fundamental assumption of this patch series is to modify the driver in guest to self-emulate or track the device state, so that the migration may be possible. I don't think we can modify OS, without modifying the drivers, even using the PCIe hotplug mechanism. In the meantime, modifying Windows OS is a big challenge given that only Microsoft can do. While, modifying driver is relatively simple and manageable to device vendors, if the device vendor want to support state-clone based migration. Thx Eddie
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Wed, Nov 25, 2015 at 12:21 AM, Lan Tianyu wrote: > On 2015年11月25日 13:30, Alexander Duyck wrote: >> No, what I am getting at is that you can't go around and modify the >> configuration space for every possible device out there. This >> solution won't scale. > > > PCI config space regs are emulation by Qemu and so We can find the free > PCI config space regs for the faked PCI capability. Its position can be > not permanent. Yes, but do you really want to edit every driver on every OS that you plan to support this on. What about things like direct assignment of regular Ethernet ports? What you really need is a solution that will work generically on any existing piece of hardware out there. >> If you instead moved the logic for notifying >> the device into a separate mechanism such as making it a part of the >> hot-plug logic then you only have to write the code once per OS in >> order to get the hot-plug capability to pause/resume the device. What >> I am talking about is not full hot-plug, but rather to extend the >> existing hot-plug in Qemu and the Linux kernel to support a >> "pause/resume" functionality. The PCI hot-plug specification calls >> out the option of implementing something like this, but we don't >> currently have support for it. >> > > Could you elaborate the part of PCI hot-plug specification you mentioned? > > My concern is whether it needs to change PCI spec or not. In the PCI Hot-Plug Specification 1.1, in section 4.1.2 it states: In addition to quiescing add-in card activity, an operating-system vendor may optionally implement a less drastic “pause” capability, in anticipation of the same or a similar add-in card being reinserted. The idea I had was basically if we were to implement something like that in Linux then we could pause/resume the device instead of outright removing it. The pause functionality could make use of the suspend/resume functionality most drivers already have for PCI power management. - Alex
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 2015年11月25日 13:30, Alexander Duyck wrote: > No, what I am getting at is that you can't go around and modify the > configuration space for every possible device out there. This > solution won't scale. PCI config space regs are emulation by Qemu and so We can find the free PCI config space regs for the faked PCI capability. Its position can be not permanent. > If you instead moved the logic for notifying > the device into a separate mechanism such as making it a part of the > hot-plug logic then you only have to write the code once per OS in > order to get the hot-plug capability to pause/resume the device. What > I am talking about is not full hot-plug, but rather to extend the > existing hot-plug in Qemu and the Linux kernel to support a > "pause/resume" functionality. The PCI hot-plug specification calls > out the option of implementing something like this, but we don't > currently have support for it. > Could you elaborate the part of PCI hot-plug specification you mentioned? My concern is whether it needs to change PCI spec or not. > I just feel doing it through PCI hot-plug messages will scale much > better as you could likely make use of the power management > suspend/resume calls to take care of most of the needed implementation > details. > > - Alex -- Best regards Tianyu Lan
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On Tue, Nov 24, 2015 at 7:18 PM, Lan Tianyu wrote: > On 2015年11月24日 22:20, Alexander Duyck wrote: >> I'm still not a fan of this approach. I really feel like this is >> something that should be resolved by extending the existing PCI hot-plug >> rather than trying to instrument this per driver. Then you will get the >> goodness for multiple drivers and multiple OSes instead of just one. An >> added advantage to dealing with this in the PCI hot-plug environment >> would be that you could then still do a hot-plug even if the guest >> didn't load a driver for the VF since you would be working with the PCI >> slot instead of the device itself. >> >> - Alex > > Hi Alex: > What's you mentioned seems the bonding driver solution. > Paper "Live Migration with Pass-through Device for Linux VM" describes > it. It does VF hotplug during migration. In order to maintain Network > connection when VF is out, it takes advantage of Linux bonding driver to > switch between VF NIC and emulated NIC. But the side affects, that > requires VM to do additional configure and the performance during > switching two NIC is not good. No, what I am getting at is that you can't go around and modify the configuration space for every possible device out there. This solution won't scale. If you instead moved the logic for notifying the device into a separate mechanism such as making it a part of the hot-plug logic then you only have to write the code once per OS in order to get the hot-plug capability to pause/resume the device. What I am talking about is not full hot-plug, but rather to extend the existing hot-plug in Qemu and the Linux kernel to support a "pause/resume" functionality. The PCI hot-plug specification calls out the option of implementing something like this, but we don't currently have support for it. I just feel doing it through PCI hot-plug messages will scale much better as you could likely make use of the power management suspend/resume calls to take care of most of the needed implementation details. - Alex
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 2015年11月24日 22:20, Alexander Duyck wrote: > I'm still not a fan of this approach. I really feel like this is > something that should be resolved by extending the existing PCI hot-plug > rather than trying to instrument this per driver. Then you will get the > goodness for multiple drivers and multiple OSes instead of just one. An > added advantage to dealing with this in the PCI hot-plug environment > would be that you could then still do a hot-plug even if the guest > didn't load a driver for the VF since you would be working with the PCI > slot instead of the device itself. > > - Alex Hi Alex: What's you mentioned seems the bonding driver solution. Paper "Live Migration with Pass-through Device for Linux VM" describes it. It does VF hotplug during migration. In order to maintain Network connection when VF is out, it takes advantage of Linux bonding driver to switch between VF NIC and emulated NIC. But the side affects, that requires VM to do additional configure and the performance during switching two NIC is not good. -- Best regards Tianyu Lan
Re: [Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
On 11/24/2015 05:38 AM, Lan Tianyu wrote: This patchset is to propose a solution of adding live migration support for SRIOV NIC. During migration, Qemu needs to let VF driver in the VM to know migration start and end. Qemu adds faked PCI migration capability to help to sync status between two sides during migration. Qemu triggers VF's mailbox irq via sending MSIX msg when migration status is changed. VF driver tells Qemu its mailbox vector index via the new PCI capability. In some cases(NIC is suspended or closed), VF mailbox irq is freed and VF driver can disable irq injecting via new capability. VF driver will put down nic before migration and put up again on the target machine. Lan Tianyu (3): VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO PCI: Add macros for faked PCI migration capability Ixgbevf: Add migration support for ixgbevf driver drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++ drivers/vfio/pci/vfio_pci.c | 21 + drivers/vfio/pci/vfio_pci_config.c| 38 ++-- drivers/vfio/pci/vfio_pci_private.h | 5 ++ include/uapi/linux/pci_regs.h | 18 +++- include/uapi/linux/vfio.h | 12 +++ 7 files changed, 194 insertions(+), 7 deletions(-) I'm still not a fan of this approach. I really feel like this is something that should be resolved by extending the existing PCI hot-plug rather than trying to instrument this per driver. Then you will get the goodness for multiple drivers and multiple OSes instead of just one. An added advantage to dealing with this in the PCI hot-plug environment would be that you could then still do a hot-plug even if the guest didn't load a driver for the VF since you would be working with the PCI slot instead of the device itself. - Alex
[Qemu-devel] [RFC PATCH V2 0/3] IXGBE/VFIO: Add live migration support for SRIOV NIC
This patchset is to propose a solution of adding live migration support for SRIOV NIC. During migration, Qemu needs to let VF driver in the VM to know migration start and end. Qemu adds faked PCI migration capability to help to sync status between two sides during migration. Qemu triggers VF's mailbox irq via sending MSIX msg when migration status is changed. VF driver tells Qemu its mailbox vector index via the new PCI capability. In some cases(NIC is suspended or closed), VF mailbox irq is freed and VF driver can disable irq injecting via new capability. VF driver will put down nic before migration and put up again on the target machine. Lan Tianyu (3): VFIO: Add new ioctl cmd VFIO_GET_PCI_CAP_INFO PCI: Add macros for faked PCI migration capability Ixgbevf: Add migration support for ixgbevf driver drivers/net/ethernet/intel/ixgbevf/ixgbevf.h | 5 ++ drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 102 ++ drivers/vfio/pci/vfio_pci.c | 21 + drivers/vfio/pci/vfio_pci_config.c| 38 ++-- drivers/vfio/pci/vfio_pci_private.h | 5 ++ include/uapi/linux/pci_regs.h | 18 +++- include/uapi/linux/vfio.h | 12 +++ 7 files changed, 194 insertions(+), 7 deletions(-) -- 1.8.4.rc0.1.g8f6a3e5.dirty