Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Wed, 8 Jan 2020 15:44:28 -0700 Alex Williamson wrote: > On Thu, 9 Jan 2020 02:11:11 +0530 > Kirti Wankhede wrote: > > > On 1/9/2020 12:01 AM, Alex Williamson wrote: > > > On Wed, 8 Jan 2020 15:59:55 +0100 > > > Cornelia Huck wrote: > > >> I think one thing we could do is start to tie the meaning more to the > > >> actual state (bit combination) and less to the individual bits. I.e. > > >> > > >> - bit 0 indicates 'running', > > >> - bit 1 indicates 'saving', > > >> - bit 2 indicates 'resuming', > > >> - bits 3-31 are reserved. [Aside: reserved-and-ignored or > > >>reserved-and-must-be-zero?] > > > > > > This version specified them as: > > > > > > Bits 3 - 31 are reserved for future use. User should perform > > > read-modify-write operation on this field. > > > > > > The intention is that the user should not make any assumptions about > > > the state of the reserved bits, but should preserve them when changing > > > known bits. Therefore I think it's ignored but preserved. If we > > > specify them as zero, then I think we lose any chance to define them > > > later. Nod. What about extending the description to: "Bits 3-31 are reserved for future use. In order to preserve them, a read-modify-write operation on this field should be used when modifying the specified bits." ? > > > > > >> [Note that I don't specify what happens when a bit is set or unset.] > > >> > > >> States are then defined as: > > >> 000b => stopped state (not saving or resuming) > > >> 001b => running state (not saving or resuming) > > >> 010b => stop-and-copy state > > >> 011b => pre-copy state > > >> 100b => resuming state > > >> > > >> [Transitions between these states defined, as before.] > > >> > > >> 101b => reserved [for post-copy; no transitions defined] > > >> 111b => reserved [state does not make sense; no transitions defined] > > >> 110b => error state [state does not make sense per se, but it does not > > >> indicate running; transitions into this state *are* possible] > > >> > > >> To a 'reserved' state, we can later assign a different meaning (we > > >> could even re-use 111b for a different error state, if needed); while > > >> the error state must always stay the error state. > > >> > > >> We should probably use some kind of feature indication to signify > > >> whether a 'reserved' state actually has a meaning. Also, maybe we also > > >> should designate the states > 111b as 'reserved'. > > >> > > >> Does that make sense? > > > > > > It seems you have an opinion to restrict this particular error state to > > > 110b rather than 11Xb, reserving 111b for some future error condition. > > > That's fine and I think we agree that using the state with _RUNNING set > > > to zero is more logical as we expect the device to be non-operational > > > in this state. Good. > > > > > > I'm also thinking more of these as states, but at the same time we're > > > not doing away with the bit definitions. I think the states are much > > > easier to decode and use if we think about the function of each bit, > > > which leads to the logical incongruity that the 11Xb states are > > > impossible and therefore must be error states. Yes, that's fine. > > > > > > > I agree on bit definition is better. > > > > Ok. Should there be a defined value for error, which can be used by > > vendor driver for error state? > > > > #define VFIO_DEVICE_STATE_ERROR \ > > (VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING) > > Seems like a good idea for consistency. Thanks, > > Alex Agreed, I like that as well.
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Thu, 9 Jan 2020 02:11:11 +0530 Kirti Wankhede wrote: > On 1/9/2020 12:01 AM, Alex Williamson wrote: > > On Wed, 8 Jan 2020 15:59:55 +0100 > > Cornelia Huck wrote: > > > >> On Tue, 7 Jan 2020 11:56:02 -0700 > >> Alex Williamson wrote: > >> > >>> On Tue, 7 Jan 2020 23:23:17 +0530 > >>> Kirti Wankhede wrote: > >> > There are 3 invalid states: > * 101b => Invalid state > * 110b => Invalid state > * 111b => Invalid state > > why only 110b should be used to report error from vendor driver to > report error? Aren't we adding more confusions in the interface? > >>> > >>> I think the only chance of confusion is poor documentation. If we > >>> define all of the above as invalid and then say any invalid state > >>> indicates an error condition, then the burden is on the user to > >>> enumerate all the invalid states. That's not a good idea. Instead we > >>> could say 101b (_RESUMING|_RUNNING) is reserved, it's not currently > >>> used but it might be useful some day. Therefore there are no valid > >>> transitions into or out of this state. A vendor driver should fail a > >>> write(2) attempting to enter this state. > >>> > >>> That leaves 11Xb, where we consider _RESUMING and _SAVING as mutually > >>> exclusive, so neither are likely to ever be valid states. Logically, > >>> if the device is in a failed state such that it needs to be reset to be > >>> recovered, I would hope the device is not running, so !_RUNNING (110b) > >>> seems appropriate. I'm not sure we need that level of detail yet > >>> though, so I was actually just assuming both 11Xb states would indicate > >>> an error state and the undefined _RUNNING bit might differentiate > >>> something in the future. > >>> > >>> Therefore, I think we'd have: > >>> > >>> * 101b => Reserved > >>> * 11Xb => Error > >>> > >>> Where the device can only self transition into the Error state on a > >>> failed device_state transition and the only exit from the Error state > >>> is via the reset ioctl. The Reserved state is unreachable. The vendor > >>> driver must error on device_state writes to enter or exit the Error > >>> state and must error on writes to enter Reserved states. Is that still > >>> confusing? > >> > >> I think one thing we could do is start to tie the meaning more to the > >> actual state (bit combination) and less to the individual bits. I.e. > >> > >> - bit 0 indicates 'running', > >> - bit 1 indicates 'saving', > >> - bit 2 indicates 'resuming', > >> - bits 3-31 are reserved. [Aside: reserved-and-ignored or > >>reserved-and-must-be-zero?] > > > > This version specified them as: > > > > Bits 3 - 31 are reserved for future use. User should perform > > read-modify-write operation on this field. > > > > The intention is that the user should not make any assumptions about > > the state of the reserved bits, but should preserve them when changing > > known bits. Therefore I think it's ignored but preserved. If we > > specify them as zero, then I think we lose any chance to define them > > later. > > > >> [Note that I don't specify what happens when a bit is set or unset.] > >> > >> States are then defined as: > >> 000b => stopped state (not saving or resuming) > >> 001b => running state (not saving or resuming) > >> 010b => stop-and-copy state > >> 011b => pre-copy state > >> 100b => resuming state > >> > >> [Transitions between these states defined, as before.] > >> > >> 101b => reserved [for post-copy; no transitions defined] > >> 111b => reserved [state does not make sense; no transitions defined] > >> 110b => error state [state does not make sense per se, but it does not > >> indicate running; transitions into this state *are* possible] > >> > >> To a 'reserved' state, we can later assign a different meaning (we > >> could even re-use 111b for a different error state, if needed); while > >> the error state must always stay the error state. > >> > >> We should probably use some kind of feature indication to signify > >> whether a 'reserved' state actually has a meaning. Also, maybe we also > >> should designate the states > 111b as 'reserved'. > >> > >> Does that make sense? > > > > It seems you have an opinion to restrict this particular error state to > > 110b rather than 11Xb, reserving 111b for some future error condition. > > That's fine and I think we agree that using the state with _RUNNING set > > to zero is more logical as we expect the device to be non-operational > > in this state. > > > > I'm also thinking more of these as states, but at the same time we're > > not doing away with the bit definitions. I think the states are much > > easier to decode and use if we think about the function of each bit, > > which leads to the logical incongruity that the 11Xb states are > > impossible and therefore must be error states. > > > > I agree on bit definition is better. > > Ok. Should there be a defined value for error, which can be
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On 1/9/2020 12:01 AM, Alex Williamson wrote: On Wed, 8 Jan 2020 15:59:55 +0100 Cornelia Huck wrote: On Tue, 7 Jan 2020 11:56:02 -0700 Alex Williamson wrote: On Tue, 7 Jan 2020 23:23:17 +0530 Kirti Wankhede wrote: There are 3 invalid states: * 101b => Invalid state * 110b => Invalid state * 111b => Invalid state why only 110b should be used to report error from vendor driver to report error? Aren't we adding more confusions in the interface? I think the only chance of confusion is poor documentation. If we define all of the above as invalid and then say any invalid state indicates an error condition, then the burden is on the user to enumerate all the invalid states. That's not a good idea. Instead we could say 101b (_RESUMING|_RUNNING) is reserved, it's not currently used but it might be useful some day. Therefore there are no valid transitions into or out of this state. A vendor driver should fail a write(2) attempting to enter this state. That leaves 11Xb, where we consider _RESUMING and _SAVING as mutually exclusive, so neither are likely to ever be valid states. Logically, if the device is in a failed state such that it needs to be reset to be recovered, I would hope the device is not running, so !_RUNNING (110b) seems appropriate. I'm not sure we need that level of detail yet though, so I was actually just assuming both 11Xb states would indicate an error state and the undefined _RUNNING bit might differentiate something in the future. Therefore, I think we'd have: * 101b => Reserved * 11Xb => Error Where the device can only self transition into the Error state on a failed device_state transition and the only exit from the Error state is via the reset ioctl. The Reserved state is unreachable. The vendor driver must error on device_state writes to enter or exit the Error state and must error on writes to enter Reserved states. Is that still confusing? I think one thing we could do is start to tie the meaning more to the actual state (bit combination) and less to the individual bits. I.e. - bit 0 indicates 'running', - bit 1 indicates 'saving', - bit 2 indicates 'resuming', - bits 3-31 are reserved. [Aside: reserved-and-ignored or reserved-and-must-be-zero?] This version specified them as: Bits 3 - 31 are reserved for future use. User should perform read-modify-write operation on this field. The intention is that the user should not make any assumptions about the state of the reserved bits, but should preserve them when changing known bits. Therefore I think it's ignored but preserved. If we specify them as zero, then I think we lose any chance to define them later. [Note that I don't specify what happens when a bit is set or unset.] States are then defined as: 000b => stopped state (not saving or resuming) 001b => running state (not saving or resuming) 010b => stop-and-copy state 011b => pre-copy state 100b => resuming state [Transitions between these states defined, as before.] 101b => reserved [for post-copy; no transitions defined] 111b => reserved [state does not make sense; no transitions defined] 110b => error state [state does not make sense per se, but it does not indicate running; transitions into this state *are* possible] To a 'reserved' state, we can later assign a different meaning (we could even re-use 111b for a different error state, if needed); while the error state must always stay the error state. We should probably use some kind of feature indication to signify whether a 'reserved' state actually has a meaning. Also, maybe we also should designate the states > 111b as 'reserved'. Does that make sense? It seems you have an opinion to restrict this particular error state to 110b rather than 11Xb, reserving 111b for some future error condition. That's fine and I think we agree that using the state with _RUNNING set to zero is more logical as we expect the device to be non-operational in this state. I'm also thinking more of these as states, but at the same time we're not doing away with the bit definitions. I think the states are much easier to decode and use if we think about the function of each bit, which leads to the logical incongruity that the 11Xb states are impossible and therefore must be error states. I agree on bit definition is better. Ok. Should there be a defined value for error, which can be used by vendor driver for error state? #define VFIO_DEVICE_STATE_ERROR \ (VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RESUMING) Thanks, Kirti I took a look at drawing a state transitions diagram, but I think we're fully interconnected for the 6 states we're defining. The user can invoke transition to any of the 5 states Connie lists above from any of those states and the 6th error state is only reached via failed transition and only exited via device reset, returning the user to the running state. There are a couple transitions of questionable
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Wed, 8 Jan 2020 15:59:55 +0100 Cornelia Huck wrote: > On Tue, 7 Jan 2020 11:56:02 -0700 > Alex Williamson wrote: > > > On Tue, 7 Jan 2020 23:23:17 +0530 > > Kirti Wankhede wrote: > > > > There are 3 invalid states: > > > * 101b => Invalid state > > > * 110b => Invalid state > > > * 111b => Invalid state > > > > > > why only 110b should be used to report error from vendor driver to > > > report error? Aren't we adding more confusions in the interface? > > > > I think the only chance of confusion is poor documentation. If we > > define all of the above as invalid and then say any invalid state > > indicates an error condition, then the burden is on the user to > > enumerate all the invalid states. That's not a good idea. Instead we > > could say 101b (_RESUMING|_RUNNING) is reserved, it's not currently > > used but it might be useful some day. Therefore there are no valid > > transitions into or out of this state. A vendor driver should fail a > > write(2) attempting to enter this state. > > > > That leaves 11Xb, where we consider _RESUMING and _SAVING as mutually > > exclusive, so neither are likely to ever be valid states. Logically, > > if the device is in a failed state such that it needs to be reset to be > > recovered, I would hope the device is not running, so !_RUNNING (110b) > > seems appropriate. I'm not sure we need that level of detail yet > > though, so I was actually just assuming both 11Xb states would indicate > > an error state and the undefined _RUNNING bit might differentiate > > something in the future. > > > > Therefore, I think we'd have: > > > > * 101b => Reserved > > * 11Xb => Error > > > > Where the device can only self transition into the Error state on a > > failed device_state transition and the only exit from the Error state > > is via the reset ioctl. The Reserved state is unreachable. The vendor > > driver must error on device_state writes to enter or exit the Error > > state and must error on writes to enter Reserved states. Is that still > > confusing? > > I think one thing we could do is start to tie the meaning more to the > actual state (bit combination) and less to the individual bits. I.e. > > - bit 0 indicates 'running', > - bit 1 indicates 'saving', > - bit 2 indicates 'resuming', > - bits 3-31 are reserved. [Aside: reserved-and-ignored or > reserved-and-must-be-zero?] This version specified them as: Bits 3 - 31 are reserved for future use. User should perform read-modify-write operation on this field. The intention is that the user should not make any assumptions about the state of the reserved bits, but should preserve them when changing known bits. Therefore I think it's ignored but preserved. If we specify them as zero, then I think we lose any chance to define them later. > [Note that I don't specify what happens when a bit is set or unset.] > > States are then defined as: > 000b => stopped state (not saving or resuming) > 001b => running state (not saving or resuming) > 010b => stop-and-copy state > 011b => pre-copy state > 100b => resuming state > > [Transitions between these states defined, as before.] > > 101b => reserved [for post-copy; no transitions defined] > 111b => reserved [state does not make sense; no transitions defined] > 110b => error state [state does not make sense per se, but it does not > indicate running; transitions into this state *are* possible] > > To a 'reserved' state, we can later assign a different meaning (we > could even re-use 111b for a different error state, if needed); while > the error state must always stay the error state. > > We should probably use some kind of feature indication to signify > whether a 'reserved' state actually has a meaning. Also, maybe we also > should designate the states > 111b as 'reserved'. > > Does that make sense? It seems you have an opinion to restrict this particular error state to 110b rather than 11Xb, reserving 111b for some future error condition. That's fine and I think we agree that using the state with _RUNNING set to zero is more logical as we expect the device to be non-operational in this state. I'm also thinking more of these as states, but at the same time we're not doing away with the bit definitions. I think the states are much easier to decode and use if we think about the function of each bit, which leads to the logical incongruity that the 11Xb states are impossible and therefore must be error states. I took a look at drawing a state transitions diagram, but I think we're fully interconnected for the 6 states we're defining. The user can invoke transition to any of the 5 states Connie lists above from any of those states and the 6th error state is only reached via failed transition and only exited via device reset, returning the user to the running state. There are a couple transitions of questionable value, particularly 01Xb -> 100b (_SAVING -> _RESUMING), but I can't convince myself that it
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Tue, 7 Jan 2020 11:56:02 -0700 Alex Williamson wrote: > On Tue, 7 Jan 2020 23:23:17 +0530 > Kirti Wankhede wrote: > > There are 3 invalid states: > > * 101b => Invalid state > > * 110b => Invalid state > > * 111b => Invalid state > > > > why only 110b should be used to report error from vendor driver to > > report error? Aren't we adding more confusions in the interface? > > I think the only chance of confusion is poor documentation. If we > define all of the above as invalid and then say any invalid state > indicates an error condition, then the burden is on the user to > enumerate all the invalid states. That's not a good idea. Instead we > could say 101b (_RESUMING|_RUNNING) is reserved, it's not currently > used but it might be useful some day. Therefore there are no valid > transitions into or out of this state. A vendor driver should fail a > write(2) attempting to enter this state. > > That leaves 11Xb, where we consider _RESUMING and _SAVING as mutually > exclusive, so neither are likely to ever be valid states. Logically, > if the device is in a failed state such that it needs to be reset to be > recovered, I would hope the device is not running, so !_RUNNING (110b) > seems appropriate. I'm not sure we need that level of detail yet > though, so I was actually just assuming both 11Xb states would indicate > an error state and the undefined _RUNNING bit might differentiate > something in the future. > > Therefore, I think we'd have: > > * 101b => Reserved > * 11Xb => Error > > Where the device can only self transition into the Error state on a > failed device_state transition and the only exit from the Error state > is via the reset ioctl. The Reserved state is unreachable. The vendor > driver must error on device_state writes to enter or exit the Error > state and must error on writes to enter Reserved states. Is that still > confusing? I think one thing we could do is start to tie the meaning more to the actual state (bit combination) and less to the individual bits. I.e. - bit 0 indicates 'running', - bit 1 indicates 'saving', - bit 2 indicates 'resuming', - bits 3-31 are reserved. [Aside: reserved-and-ignored or reserved-and-must-be-zero?] [Note that I don't specify what happens when a bit is set or unset.] States are then defined as: 000b => stopped state (not saving or resuming) 001b => running state (not saving or resuming) 010b => stop-and-copy state 011b => pre-copy state 100b => resuming state [Transitions between these states defined, as before.] 101b => reserved [for post-copy; no transitions defined] 111b => reserved [state does not make sense; no transitions defined] 110b => error state [state does not make sense per se, but it does not indicate running; transitions into this state *are* possible] To a 'reserved' state, we can later assign a different meaning (we could even re-use 111b for a different error state, if needed); while the error state must always stay the error state. We should probably use some kind of feature indication to signify whether a 'reserved' state actually has a meaning. Also, maybe we also should designate the states > 111b as 'reserved'. Does that make sense?
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Tue, 7 Jan 2020 23:23:17 +0530 Kirti Wankhede wrote: > On 1/7/2020 10:39 PM, Alex Williamson wrote: > > On Tue, 7 Jan 2020 12:58:22 +0530 > > Kirti Wankhede wrote: > > > >> On 1/7/2020 4:48 AM, Alex Williamson wrote: > >>> On Thu, 2 Jan 2020 18:25:37 + > >>> "Dr. David Alan Gilbert" wrote: > >>> > * Alex Williamson (alex.william...@redhat.com) wrote: > > On Fri, 20 Dec 2019 01:40:35 +0530 > > Kirti Wankhede wrote: > > > >> On 12/19/2019 10:57 PM, Alex Williamson wrote: > >> > >> > >> > > > > >> > >> If device state it at pre-copy state (011b). > >> Transition, i.e., write to device state as stop-and-copy state (010b) > >> failed, then by previous state I meant device should return pre-copy > >> state(011b), i.e. previous state which was successfully set, or as you > >> said current state which was successfully set. > > > > Yes, the point I'm trying to make is that this version of the spec > > tries to tell the user what they should do upon error according to our > > current interpretation of the QEMU migration protocol. We're not > > defining the QEMU migration protocol, we're defining something that can > > be used in a way to support that protocol. So I think we should be > > concerned with defining our spec, for example my proposal would be: "If > > a state transition fails the user can read device_state to determine the > > current state of the device. This should be the previous state of the > > device unless the vendor driver has encountered an internal error, in > > which case the device may report the invalid device_state 110b. The > > user must use the device reset ioctl in order to recover the device > > from this state. If the device is indicated in a valid device state > > via reading device_state, the user may attempt to transition the device > > to any valid state reachable from the current state." > > We might want to be able to distinguish between: > a) The device has failed and needs a reset > b) The migration has failed > >>> > >>> I think the above provides this. For Kirti's example above of > >>> transitioning from pre-copy to stop-and-copy, the device could refuse > >>> to transition to stop-and-copy, generating an error on the write() of > >>> device_state. The user re-reading device_state would allow them to > >>> determine the current device state, still in pre-copy or failed. Only > >>> the latter would require a device reset. > >>> > If some part of the devices mechanics for migration fail, but the device > is otherwise operational then we should be able to decide to fail the > migration without taking the device down, which might be very bad for > the VM. > Losing a VM during migration due to a problem with migration really > annoys users; it's one thing the migration failing, but taking the VM > out as well really gets to them. > > Having the device automatically transition back to the 'running' state > seems a bad idea to me; much better to tell the hypervisor and provide > it with a way to clean up; for example, imagine a system with multiple > devices that are being migrated, most of them have happily transitioned > to stop-and-copy, but then the last device decides to fail - so now > someone is going to have to take all of them back to running. > >>> > >>> Right, unless I'm missing one, it seems invalid->running is the only > >>> self transition the device should make, though still by way of user > >>> interaction via the reset ioctl. Thanks, > >>> > >> > >> Instead of using invalid state by vendor driver on device failure, I > >> think better to reserve one bit in device state which vendor driver can > >> set on device failure. When error bit is set, other bits in device state > >> should be ignored. > > > > Why is a separate bit better? Saving and Restoring states are mutually > > exclusive, so we have an unused and invalid device state already > > without burning another bit. Thanks, > > > > There are 3 invalid states: > * 101b => Invalid state > * 110b => Invalid state > * 111b => Invalid state > > why only 110b should be used to report error from vendor driver to > report error? Aren't we adding more confusions in the interface? I think the only chance of confusion is poor documentation. If we define all of the above as invalid and then say any invalid state indicates an error condition, then the burden is on the user to enumerate all the invalid states. That's not a good idea. Instead we could say 101b (_RESUMING|_RUNNING) is reserved, it's not currently used but it might be useful some day. Therefore there are no valid transitions into or out of this state. A vendor driver should fail a write(2) attempting to enter this state. That leaves
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On 1/7/2020 10:39 PM, Alex Williamson wrote: On Tue, 7 Jan 2020 12:58:22 +0530 Kirti Wankhede wrote: On 1/7/2020 4:48 AM, Alex Williamson wrote: On Thu, 2 Jan 2020 18:25:37 + "Dr. David Alan Gilbert" wrote: * Alex Williamson (alex.william...@redhat.com) wrote: On Fri, 20 Dec 2019 01:40:35 +0530 Kirti Wankhede wrote: On 12/19/2019 10:57 PM, Alex Williamson wrote: If device state it at pre-copy state (011b). Transition, i.e., write to device state as stop-and-copy state (010b) failed, then by previous state I meant device should return pre-copy state(011b), i.e. previous state which was successfully set, or as you said current state which was successfully set. Yes, the point I'm trying to make is that this version of the spec tries to tell the user what they should do upon error according to our current interpretation of the QEMU migration protocol. We're not defining the QEMU migration protocol, we're defining something that can be used in a way to support that protocol. So I think we should be concerned with defining our spec, for example my proposal would be: "If a state transition fails the user can read device_state to determine the current state of the device. This should be the previous state of the device unless the vendor driver has encountered an internal error, in which case the device may report the invalid device_state 110b. The user must use the device reset ioctl in order to recover the device from this state. If the device is indicated in a valid device state via reading device_state, the user may attempt to transition the device to any valid state reachable from the current state." We might want to be able to distinguish between: a) The device has failed and needs a reset b) The migration has failed I think the above provides this. For Kirti's example above of transitioning from pre-copy to stop-and-copy, the device could refuse to transition to stop-and-copy, generating an error on the write() of device_state. The user re-reading device_state would allow them to determine the current device state, still in pre-copy or failed. Only the latter would require a device reset. If some part of the devices mechanics for migration fail, but the device is otherwise operational then we should be able to decide to fail the migration without taking the device down, which might be very bad for the VM. Losing a VM during migration due to a problem with migration really annoys users; it's one thing the migration failing, but taking the VM out as well really gets to them. Having the device automatically transition back to the 'running' state seems a bad idea to me; much better to tell the hypervisor and provide it with a way to clean up; for example, imagine a system with multiple devices that are being migrated, most of them have happily transitioned to stop-and-copy, but then the last device decides to fail - so now someone is going to have to take all of them back to running. Right, unless I'm missing one, it seems invalid->running is the only self transition the device should make, though still by way of user interaction via the reset ioctl. Thanks, Instead of using invalid state by vendor driver on device failure, I think better to reserve one bit in device state which vendor driver can set on device failure. When error bit is set, other bits in device state should be ignored. Why is a separate bit better? Saving and Restoring states are mutually exclusive, so we have an unused and invalid device state already without burning another bit. Thanks, There are 3 invalid states: * 101b => Invalid state * 110b => Invalid state * 111b => Invalid state why only 110b should be used to report error from vendor driver to report error? Aren't we adding more confusions in the interface? Only 3 bits from 32 bits are yet used, one bit can be spared to represent error state from vendor driver. Thanks, Kirti
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
* Alex Williamson (alex.william...@redhat.com) wrote: > On Tue, 7 Jan 2020 09:57:40 + > "Dr. David Alan Gilbert" wrote: > > > * Alex Williamson (alex.william...@redhat.com) wrote: > > > On Thu, 2 Jan 2020 18:25:37 + > > > "Dr. David Alan Gilbert" wrote: > > > > > > > * Alex Williamson (alex.william...@redhat.com) wrote: > > > > > On Fri, 20 Dec 2019 01:40:35 +0530 > > > > > Kirti Wankhede wrote: > > > > > > > > > > > On 12/19/2019 10:57 PM, Alex Williamson wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If device state it at pre-copy state (011b). > > > > > > Transition, i.e., write to device state as stop-and-copy state > > > > > > (010b) > > > > > > failed, then by previous state I meant device should return > > > > > > pre-copy > > > > > > state(011b), i.e. previous state which was successfully set, or as > > > > > > you > > > > > > said current state which was successfully set. > > > > > > > > > > Yes, the point I'm trying to make is that this version of the spec > > > > > tries to tell the user what they should do upon error according to our > > > > > current interpretation of the QEMU migration protocol. We're not > > > > > defining the QEMU migration protocol, we're defining something that > > > > > can > > > > > be used in a way to support that protocol. So I think we should be > > > > > concerned with defining our spec, for example my proposal would be: > > > > > "If > > > > > a state transition fails the user can read device_state to determine > > > > > the > > > > > current state of the device. This should be the previous state of the > > > > > device unless the vendor driver has encountered an internal error, in > > > > > which case the device may report the invalid device_state 110b. The > > > > > user must use the device reset ioctl in order to recover the device > > > > > from this state. If the device is indicated in a valid device state > > > > > via reading device_state, the user may attempt to transition the > > > > > device > > > > > to any valid state reachable from the current state." > > > > > > > > We might want to be able to distinguish between: > > > > a) The device has failed and needs a reset > > > > b) The migration has failed > > > > > > I think the above provides this. For Kirti's example above of > > > transitioning from pre-copy to stop-and-copy, the device could refuse > > > to transition to stop-and-copy, generating an error on the write() of > > > device_state. The user re-reading device_state would allow them to > > > determine the current device state, still in pre-copy or failed. Only > > > the latter would require a device reset. > > > > OK - but that doesn't give you any way to figure out 'why' it failed; > > I guess I was expecting you to then read an 'error' register to find > > out what happened. > > Assuming the write() to transition to stop-and-copy fails and you're > > still in pre-copy, what's the defined thing you're supposed to do next? > > Decide migration has failed and then do a write() to transition to running? > > Defining semantics for an error register seems like a project on its > own. We do have flags, we could use them to add an error register > later, but I think it's only going to rat hole this effort to try to > incorporate that now. OK, to be honest I didn't really mean for that thing to be used by code to decide on it's next action, rather to have something to report when it failed. > The state machine is fairly small, so in the > scenario you present, I think the user would assume a failure at > pre-copy to stop-and-copy transition would fail the migration and the > device could go back to running state. If the device then fails to > return to the running state, we might be stuck with a device with > reduced performance or overhead and the user could warn about that and > continue with the device as-is. The vendor drivers could make use of > -EAGAIN on transition failure to indicate a temporary issue, but > otherwise the user should probably consider it a persistent error until > either a device reset or start of a new migration sequence (ie. return > to running and start over). Thanks, OK as long as we define somewhere that the action on a failed transition is then try and transitino to running before restarting the VM and fail the migration. Dave > Alex -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Tue, 7 Jan 2020 12:58:22 +0530 Kirti Wankhede wrote: > On 1/7/2020 4:48 AM, Alex Williamson wrote: > > On Thu, 2 Jan 2020 18:25:37 + > > "Dr. David Alan Gilbert" wrote: > > > >> * Alex Williamson (alex.william...@redhat.com) wrote: > >>> On Fri, 20 Dec 2019 01:40:35 +0530 > >>> Kirti Wankhede wrote: > >>> > On 12/19/2019 10:57 PM, Alex Williamson wrote: > > > > >> > >> > >> > > If device state it at pre-copy state (011b). > Transition, i.e., write to device state as stop-and-copy state (010b) > failed, then by previous state I meant device should return pre-copy > state(011b), i.e. previous state which was successfully set, or as you > said current state which was successfully set. > >>> > >>> Yes, the point I'm trying to make is that this version of the spec > >>> tries to tell the user what they should do upon error according to our > >>> current interpretation of the QEMU migration protocol. We're not > >>> defining the QEMU migration protocol, we're defining something that can > >>> be used in a way to support that protocol. So I think we should be > >>> concerned with defining our spec, for example my proposal would be: "If > >>> a state transition fails the user can read device_state to determine the > >>> current state of the device. This should be the previous state of the > >>> device unless the vendor driver has encountered an internal error, in > >>> which case the device may report the invalid device_state 110b. The > >>> user must use the device reset ioctl in order to recover the device > >>> from this state. If the device is indicated in a valid device state > >>> via reading device_state, the user may attempt to transition the device > >>> to any valid state reachable from the current state." > >> > >> We might want to be able to distinguish between: > >>a) The device has failed and needs a reset > >>b) The migration has failed > > > > I think the above provides this. For Kirti's example above of > > transitioning from pre-copy to stop-and-copy, the device could refuse > > to transition to stop-and-copy, generating an error on the write() of > > device_state. The user re-reading device_state would allow them to > > determine the current device state, still in pre-copy or failed. Only > > the latter would require a device reset. > > > >> If some part of the devices mechanics for migration fail, but the device > >> is otherwise operational then we should be able to decide to fail the > >> migration without taking the device down, which might be very bad for > >> the VM. > >> Losing a VM during migration due to a problem with migration really > >> annoys users; it's one thing the migration failing, but taking the VM > >> out as well really gets to them. > >> > >> Having the device automatically transition back to the 'running' state > >> seems a bad idea to me; much better to tell the hypervisor and provide > >> it with a way to clean up; for example, imagine a system with multiple > >> devices that are being migrated, most of them have happily transitioned > >> to stop-and-copy, but then the last device decides to fail - so now > >> someone is going to have to take all of them back to running. > > > > Right, unless I'm missing one, it seems invalid->running is the only > > self transition the device should make, though still by way of user > > interaction via the reset ioctl. Thanks, > > > > Instead of using invalid state by vendor driver on device failure, I > think better to reserve one bit in device state which vendor driver can > set on device failure. When error bit is set, other bits in device state > should be ignored. Why is a separate bit better? Saving and Restoring states are mutually exclusive, so we have an unused and invalid device state already without burning another bit. Thanks, Alex
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Tue, 7 Jan 2020 09:57:40 + "Dr. David Alan Gilbert" wrote: > * Alex Williamson (alex.william...@redhat.com) wrote: > > On Thu, 2 Jan 2020 18:25:37 + > > "Dr. David Alan Gilbert" wrote: > > > > > * Alex Williamson (alex.william...@redhat.com) wrote: > > > > On Fri, 20 Dec 2019 01:40:35 +0530 > > > > Kirti Wankhede wrote: > > > > > > > > > On 12/19/2019 10:57 PM, Alex Williamson wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > If device state it at pre-copy state (011b). > > > > > Transition, i.e., write to device state as stop-and-copy state (010b) > > > > > failed, then by previous state I meant device should return pre-copy > > > > > state(011b), i.e. previous state which was successfully set, or as > > > > > you > > > > > said current state which was successfully set. > > > > > > > > Yes, the point I'm trying to make is that this version of the spec > > > > tries to tell the user what they should do upon error according to our > > > > current interpretation of the QEMU migration protocol. We're not > > > > defining the QEMU migration protocol, we're defining something that can > > > > be used in a way to support that protocol. So I think we should be > > > > concerned with defining our spec, for example my proposal would be: "If > > > > a state transition fails the user can read device_state to determine the > > > > current state of the device. This should be the previous state of the > > > > device unless the vendor driver has encountered an internal error, in > > > > which case the device may report the invalid device_state 110b. The > > > > user must use the device reset ioctl in order to recover the device > > > > from this state. If the device is indicated in a valid device state > > > > via reading device_state, the user may attempt to transition the device > > > > to any valid state reachable from the current state." > > > > > > We might want to be able to distinguish between: > > > a) The device has failed and needs a reset > > > b) The migration has failed > > > > I think the above provides this. For Kirti's example above of > > transitioning from pre-copy to stop-and-copy, the device could refuse > > to transition to stop-and-copy, generating an error on the write() of > > device_state. The user re-reading device_state would allow them to > > determine the current device state, still in pre-copy or failed. Only > > the latter would require a device reset. > > OK - but that doesn't give you any way to figure out 'why' it failed; > I guess I was expecting you to then read an 'error' register to find > out what happened. > Assuming the write() to transition to stop-and-copy fails and you're > still in pre-copy, what's the defined thing you're supposed to do next? > Decide migration has failed and then do a write() to transition to running? Defining semantics for an error register seems like a project on its own. We do have flags, we could use them to add an error register later, but I think it's only going to rat hole this effort to try to incorporate that now. The state machine is fairly small, so in the scenario you present, I think the user would assume a failure at pre-copy to stop-and-copy transition would fail the migration and the device could go back to running state. If the device then fails to return to the running state, we might be stuck with a device with reduced performance or overhead and the user could warn about that and continue with the device as-is. The vendor drivers could make use of -EAGAIN on transition failure to indicate a temporary issue, but otherwise the user should probably consider it a persistent error until either a device reset or start of a new migration sequence (ie. return to running and start over). Thanks, Alex
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
* Alex Williamson (alex.william...@redhat.com) wrote: > On Thu, 2 Jan 2020 18:25:37 + > "Dr. David Alan Gilbert" wrote: > > > * Alex Williamson (alex.william...@redhat.com) wrote: > > > On Fri, 20 Dec 2019 01:40:35 +0530 > > > Kirti Wankhede wrote: > > > > > > > On 12/19/2019 10:57 PM, Alex Williamson wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > If device state it at pre-copy state (011b). > > > > Transition, i.e., write to device state as stop-and-copy state (010b) > > > > failed, then by previous state I meant device should return pre-copy > > > > state(011b), i.e. previous state which was successfully set, or as you > > > > said current state which was successfully set. > > > > > > Yes, the point I'm trying to make is that this version of the spec > > > tries to tell the user what they should do upon error according to our > > > current interpretation of the QEMU migration protocol. We're not > > > defining the QEMU migration protocol, we're defining something that can > > > be used in a way to support that protocol. So I think we should be > > > concerned with defining our spec, for example my proposal would be: "If > > > a state transition fails the user can read device_state to determine the > > > current state of the device. This should be the previous state of the > > > device unless the vendor driver has encountered an internal error, in > > > which case the device may report the invalid device_state 110b. The > > > user must use the device reset ioctl in order to recover the device > > > from this state. If the device is indicated in a valid device state > > > via reading device_state, the user may attempt to transition the device > > > to any valid state reachable from the current state." > > > > We might want to be able to distinguish between: > > a) The device has failed and needs a reset > > b) The migration has failed > > I think the above provides this. For Kirti's example above of > transitioning from pre-copy to stop-and-copy, the device could refuse > to transition to stop-and-copy, generating an error on the write() of > device_state. The user re-reading device_state would allow them to > determine the current device state, still in pre-copy or failed. Only > the latter would require a device reset. OK - but that doesn't give you any way to figure out 'why' it failed; I guess I was expecting you to then read an 'error' register to find out what happened. Assuming the write() to transition to stop-and-copy fails and you're still in pre-copy, what's the defined thing you're supposed to do next? Decide migration has failed and then do a write() to transition to running? > > If some part of the devices mechanics for migration fail, but the device > > is otherwise operational then we should be able to decide to fail the > > migration without taking the device down, which might be very bad for > > the VM. > > Losing a VM during migration due to a problem with migration really > > annoys users; it's one thing the migration failing, but taking the VM > > out as well really gets to them. > > > > Having the device automatically transition back to the 'running' state > > seems a bad idea to me; much better to tell the hypervisor and provide > > it with a way to clean up; for example, imagine a system with multiple > > devices that are being migrated, most of them have happily transitioned > > to stop-and-copy, but then the last device decides to fail - so now > > someone is going to have to take all of them back to running. > > Right, unless I'm missing one, it seems invalid->running is the only > self transition the device should make, though still by way of user > interaction via the reset ioctl. Thanks, > o Dave > Alex -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On 1/7/2020 4:48 AM, Alex Williamson wrote: On Thu, 2 Jan 2020 18:25:37 + "Dr. David Alan Gilbert" wrote: * Alex Williamson (alex.william...@redhat.com) wrote: On Fri, 20 Dec 2019 01:40:35 +0530 Kirti Wankhede wrote: On 12/19/2019 10:57 PM, Alex Williamson wrote: If device state it at pre-copy state (011b). Transition, i.e., write to device state as stop-and-copy state (010b) failed, then by previous state I meant device should return pre-copy state(011b), i.e. previous state which was successfully set, or as you said current state which was successfully set. Yes, the point I'm trying to make is that this version of the spec tries to tell the user what they should do upon error according to our current interpretation of the QEMU migration protocol. We're not defining the QEMU migration protocol, we're defining something that can be used in a way to support that protocol. So I think we should be concerned with defining our spec, for example my proposal would be: "If a state transition fails the user can read device_state to determine the current state of the device. This should be the previous state of the device unless the vendor driver has encountered an internal error, in which case the device may report the invalid device_state 110b. The user must use the device reset ioctl in order to recover the device from this state. If the device is indicated in a valid device state via reading device_state, the user may attempt to transition the device to any valid state reachable from the current state." We might want to be able to distinguish between: a) The device has failed and needs a reset b) The migration has failed I think the above provides this. For Kirti's example above of transitioning from pre-copy to stop-and-copy, the device could refuse to transition to stop-and-copy, generating an error on the write() of device_state. The user re-reading device_state would allow them to determine the current device state, still in pre-copy or failed. Only the latter would require a device reset. If some part of the devices mechanics for migration fail, but the device is otherwise operational then we should be able to decide to fail the migration without taking the device down, which might be very bad for the VM. Losing a VM during migration due to a problem with migration really annoys users; it's one thing the migration failing, but taking the VM out as well really gets to them. Having the device automatically transition back to the 'running' state seems a bad idea to me; much better to tell the hypervisor and provide it with a way to clean up; for example, imagine a system with multiple devices that are being migrated, most of them have happily transitioned to stop-and-copy, but then the last device decides to fail - so now someone is going to have to take all of them back to running. Right, unless I'm missing one, it seems invalid->running is the only self transition the device should make, though still by way of user interaction via the reset ioctl. Thanks, Instead of using invalid state by vendor driver on device failure, I think better to reserve one bit in device state which vendor driver can set on device failure. When error bit is set, other bits in device state should be ignored. Thanks, Kirti
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Thu, 2 Jan 2020 18:25:37 + "Dr. David Alan Gilbert" wrote: > * Alex Williamson (alex.william...@redhat.com) wrote: > > On Fri, 20 Dec 2019 01:40:35 +0530 > > Kirti Wankhede wrote: > > > > > On 12/19/2019 10:57 PM, Alex Williamson wrote: > > > > > > > > > > > > > > > > > > If device state it at pre-copy state (011b). > > > Transition, i.e., write to device state as stop-and-copy state (010b) > > > failed, then by previous state I meant device should return pre-copy > > > state(011b), i.e. previous state which was successfully set, or as you > > > said current state which was successfully set. > > > > Yes, the point I'm trying to make is that this version of the spec > > tries to tell the user what they should do upon error according to our > > current interpretation of the QEMU migration protocol. We're not > > defining the QEMU migration protocol, we're defining something that can > > be used in a way to support that protocol. So I think we should be > > concerned with defining our spec, for example my proposal would be: "If > > a state transition fails the user can read device_state to determine the > > current state of the device. This should be the previous state of the > > device unless the vendor driver has encountered an internal error, in > > which case the device may report the invalid device_state 110b. The > > user must use the device reset ioctl in order to recover the device > > from this state. If the device is indicated in a valid device state > > via reading device_state, the user may attempt to transition the device > > to any valid state reachable from the current state." > > We might want to be able to distinguish between: > a) The device has failed and needs a reset > b) The migration has failed I think the above provides this. For Kirti's example above of transitioning from pre-copy to stop-and-copy, the device could refuse to transition to stop-and-copy, generating an error on the write() of device_state. The user re-reading device_state would allow them to determine the current device state, still in pre-copy or failed. Only the latter would require a device reset. > If some part of the devices mechanics for migration fail, but the device > is otherwise operational then we should be able to decide to fail the > migration without taking the device down, which might be very bad for > the VM. > Losing a VM during migration due to a problem with migration really > annoys users; it's one thing the migration failing, but taking the VM > out as well really gets to them. > > Having the device automatically transition back to the 'running' state > seems a bad idea to me; much better to tell the hypervisor and provide > it with a way to clean up; for example, imagine a system with multiple > devices that are being migrated, most of them have happily transitioned > to stop-and-copy, but then the last device decides to fail - so now > someone is going to have to take all of them back to running. Right, unless I'm missing one, it seems invalid->running is the only self transition the device should make, though still by way of user interaction via the reset ioctl. Thanks, Alex
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
* Alex Williamson (alex.william...@redhat.com) wrote: > On Fri, 20 Dec 2019 01:40:35 +0530 > Kirti Wankhede wrote: > > > On 12/19/2019 10:57 PM, Alex Williamson wrote: > > > > > > > > > > If device state it at pre-copy state (011b). > > Transition, i.e., write to device state as stop-and-copy state (010b) > > failed, then by previous state I meant device should return pre-copy > > state(011b), i.e. previous state which was successfully set, or as you > > said current state which was successfully set. > > Yes, the point I'm trying to make is that this version of the spec > tries to tell the user what they should do upon error according to our > current interpretation of the QEMU migration protocol. We're not > defining the QEMU migration protocol, we're defining something that can > be used in a way to support that protocol. So I think we should be > concerned with defining our spec, for example my proposal would be: "If > a state transition fails the user can read device_state to determine the > current state of the device. This should be the previous state of the > device unless the vendor driver has encountered an internal error, in > which case the device may report the invalid device_state 110b. The > user must use the device reset ioctl in order to recover the device > from this state. If the device is indicated in a valid device state > via reading device_state, the user may attempt to transition the device > to any valid state reachable from the current state." We might want to be able to distinguish between: a) The device has failed and needs a reset b) The migration has failed If some part of the devices mechanics for migration fail, but the device is otherwise operational then we should be able to decide to fail the migration without taking the device down, which might be very bad for the VM. Losing a VM during migration due to a problem with migration really annoys users; it's one thing the migration failing, but taking the VM out as well really gets to them. Having the device automatically transition back to the 'running' state seems a bad idea to me; much better to tell the hypervisor and provide it with a way to clean up; for example, imagine a system with multiple devices that are being migrated, most of them have happily transitioned to stop-and-copy, but then the last device decides to fail - so now someone is going to have to take all of them back to running. Dave > > >>> and allowable state transitions independent > > >>> of the expected usage model. > > >> > > >> Do you mean to define array of ['from','to'], same as runstate > > >> transition array in QEMU? > > >>static const RunStateTransition runstate_transitions_def[] > > > > > > I'm thinking that independent of expected QEMU usage models, are there > > > any invalid transitions or is every state reachable from every other > > > state. I'm afraid this design is so focused on a specific usage model > > > that vendor drivers are going to fall over if the user invokes a > > > transition outside of those listed above. If there are invalid > > > transitions, those should be listed so they can be handled > > > consistently. If there are no invalid transitions, it should be noted > > > in the spec to encourage vendor drivers to expect this. > > > > > > > I think vendor driver can decide which state transitions it can support, > > rather than defining/prescribing that all. > > Suppose, if vendor driver doesn't want to support save-restore > > functionality, then vendor driver can return error -EINVAL for write() > > operation on device_state for transition from _RUNNING to > > stop-and-copy(010b) state. > > This is unsupportable. If the vendor driver doesn't want to support > save-restore then they simply do not implement the migration > extensions. If they expose this interface then the user (QEMU) will > rightfully assume that the device supports migration, only to find out > upon trying to use it that it's unsupported, or maybe broken. > > > >>> For example, I think a user should always > > >>> be allowed to transition a device to stopped regardless of the expected > > >>> migration flow. An error might have occurred elsewhere and we want to > > >>> stop everything for debugging. I think it's also allowable to switch > > >>> directly from running to stop-and-copy, for example to save and resume > > >>> a VM offline. > > >>> > > > Also, it seems like it's the vendor driver's discretion to actually > > > provide data during the pre-copy phase. As we've defined it, the > > > vendor driver needs to participate in the migration region regardless, > > > they might just always report no pending_bytes until we enter > > > stop-and-copy. > > > > > > > Yes. And if pending_bytes are reported as 0 in pre-copy by vendor > > driver > > then QEMU doesn't reiterate for that device. > > >>> > > >>> Maybe we can state that as the expected m
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Fri, 20 Dec 2019 01:40:35 +0530 Kirti Wankhede wrote: > On 12/19/2019 10:57 PM, Alex Williamson wrote: > > > > > >> + * 2. When VFIO application save state or suspend application, VFIO > >> device > >> + *state transition is from _RUNNING to stop-and-copy state and > >> then to > >> + *_STOP. > >> + *On state transition from _RUNNING to stop-and-copy, driver must > >> + *stop device, save device state and send it to application > >> through > >> + *migration region. > >> + *On _RUNNING to stop-and-copy state transition failure, > >> application should > >> + *set VFIO device state to _RUNNING. > > > > A state transition failure means that the user's write to device_state > > failed, so is it the user's responsibility to set the next state? > > Right. > >>> > >>> If a transition failure occurs, ie. errno from write(2), what value is > >>> reported by a read(2) of device_state in the interim between the failure > >>> and a next state written by the user? > >> > >> Since state transition has failed, driver should return previous state. > >> > >>> If this is a valid state, > >>> wouldn't it be reasonable for the user to assume the device is already > >>> operating in that state? > > > > This ^^^ > > > >> If it's an invalid state, do we need to > >>> define the use cases for those invalid states? If the user needs to > >>> set the state back to _RUNNING, that suggests the device might be > >>> stopped, which has implications beyond the migration state. > >>> > >> > >> Not necessarily stopped. For example, during live migration: > >> > >> * _RESUMING _RUNNINGPre-copyStop-and-copy _STOP > >> *(100b) (001b) (011b)(010b) (000b) > >> * > >> * 3. Save state during live migration > >> * |--->|>|-->| > >> > >> on any state transition failure, user should set _RUNNING state. > >> pre-copy (011b) -> stop-and-copy(010b) => _SAVING flag is cleared > >> and device returned back to _RUNNING. > >> Stop-and-copy(010b) -> STOP (000b) > device is already stopped. > > > > IMO, the user may modify the state, but the vendor driver should report > > the current state of the device via device_state, which the user may > > read after a transition error occurs. The spec lacks a provision for > > indicating the device is in a non-functional state. > > > > Are you proposing to add a bit in device state to report error? > > #define VFIO_DEVICE_STATE_ERROR (1 << 3) > > which can be set by vendor driver and when its set, other bits set/clear > doesn't matter. We can represent an invalid state with the bits we've defined, for instance 110b (_SAVING|_RESTORING) is bogus. We could define that as a state the vendor driver can use to report the device is in an error condition. > >Why > > is it necessarily _RUNNING vs _STOP? > > > > While changing From pre-copy to stop-and-copy transition, device is > still running, only saving of device state started. Now if transition to > stop-and-copy fails, from user point of view application or VM is still > running, device state should be set to _RUNNING so that whatever the > application/VM is running should continue at source. > >>> > >>> Seems it's the users discretion whether to consider this continuable or > >>> fatal, the vfio interface specification should support a given usage > >>> model, not prescribe it. > >>> > >> > >> Updating comment. > >> > >> + * 3. In VFIO application live migration, state transition is from > >> _RUNNING > >> + *to pre-copy to stop-and-copy to _STOP. > >> + *On state transition from _RUNNING to pre-copy, driver should > >> start > >> + *gathering device state while application is still running and > >> send device > >> + *state data to application through migration region. > >> + *On state transition from pre-copy to stop-and-copy, driver must > >> stop > >> + *device, save device state and send it to application through > >> migration > >> + *region. > >> + *On any failure during any of these state transition, VFIO > >> device state > >> + *should be set to _RUNNING. > > > > Same comment as above regarding next state on failure. > > > > If application or VM migration fails, it should continue to run at > source. In case of VM, guest user isn't aware of migration, and from his > point VM should be running. > >>> > >>> vfio is not prescribing the migration semantics to userspace, it's > >>> presenting an interface that support the user semantics. Therefore, > >>> while it's useful to understand the expected usage model, I think we > >>> also need a mechanism that the user can always dete
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On 12/19/2019 10:57 PM, Alex Williamson wrote: + * 2. When VFIO application save state or suspend application, VFIO device + *state transition is from _RUNNING to stop-and-copy state and then to + *_STOP. + *On state transition from _RUNNING to stop-and-copy, driver must + *stop device, save device state and send it to application through + *migration region. + *On _RUNNING to stop-and-copy state transition failure, application should + *set VFIO device state to _RUNNING. A state transition failure means that the user's write to device_state failed, so is it the user's responsibility to set the next state? Right. If a transition failure occurs, ie. errno from write(2), what value is reported by a read(2) of device_state in the interim between the failure and a next state written by the user? Since state transition has failed, driver should return previous state. If this is a valid state, wouldn't it be reasonable for the user to assume the device is already operating in that state? This ^^^ If it's an invalid state, do we need to define the use cases for those invalid states? If the user needs to set the state back to _RUNNING, that suggests the device might be stopped, which has implications beyond the migration state. Not necessarily stopped. For example, during live migration: * _RESUMING _RUNNINGPre-copyStop-and-copy _STOP *(100b) (001b) (011b)(010b) (000b) * * 3. Save state during live migration * |--->|>|-->| on any state transition failure, user should set _RUNNING state. pre-copy (011b) -> stop-and-copy(010b) => _SAVING flag is cleared and device returned back to _RUNNING. Stop-and-copy(010b) -> STOP (000b) > device is already stopped. IMO, the user may modify the state, but the vendor driver should report the current state of the device via device_state, which the user may read after a transition error occurs. The spec lacks a provision for indicating the device is in a non-functional state. Are you proposing to add a bit in device state to report error? #define VFIO_DEVICE_STATE_ERROR (1 << 3) which can be set by vendor driver and when its set, other bits set/clear doesn't matter. Why is it necessarily _RUNNING vs _STOP? While changing From pre-copy to stop-and-copy transition, device is still running, only saving of device state started. Now if transition to stop-and-copy fails, from user point of view application or VM is still running, device state should be set to _RUNNING so that whatever the application/VM is running should continue at source. Seems it's the users discretion whether to consider this continuable or fatal, the vfio interface specification should support a given usage model, not prescribe it. Updating comment. + * 3. In VFIO application live migration, state transition is from _RUNNING + *to pre-copy to stop-and-copy to _STOP. + *On state transition from _RUNNING to pre-copy, driver should start + *gathering device state while application is still running and send device + *state data to application through migration region. + *On state transition from pre-copy to stop-and-copy, driver must stop + *device, save device state and send it to application through migration + *region. + *On any failure during any of these state transition, VFIO device state + *should be set to _RUNNING. Same comment as above regarding next state on failure. If application or VM migration fails, it should continue to run at source. In case of VM, guest user isn't aware of migration, and from his point VM should be running. vfio is not prescribing the migration semantics to userspace, it's presenting an interface that support the user semantics. Therefore, while it's useful to understand the expected usage model, I think we also need a mechanism that the user can always determine the device_state after a fault If state transition fails, device is in previous state and driver should return previous state Then why is it stated that the user needs to set the _RUNNING state? It's the user's choice. But I do think we're lacking a state to indicate an internal fault. If device state it at pre-copy state (011b). Transition, i.e., write to device state as stop-and-copy state (010b) failed, then by previous state I meant device should return pre-copy state(011b), i.e. previous state which was successfully set, or as you said current state which was successfully set. and allowable state transitions independent of the expected usage model. Do you mean to define array of ['from','to'], same as runstate transition array in QEMU? static const RunStateTransition runstate_transitions_def[] I'm thinking that independent of expected QEMU usage models, are there any invalid transitions or is every state reachable from every oth
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Thu, 19 Dec 2019 21:38:44 +0530 Kirti Wankhede wrote: > On 12/18/2019 12:13 AM, Alex Williamson wrote: > > On Tue, 17 Dec 2019 11:58:44 +0530 > > Kirti Wankhede wrote: > > > >> On 12/17/2019 4:14 AM, Alex Williamson wrote: > >>> On Tue, 17 Dec 2019 01:51:36 +0530 > >>> Kirti Wankhede wrote: > >>> > - Defined MIGRATION region type and sub-type. > > - Defined vfio_device_migration_info structure which will be placed at > 0th > offset of migration region to get/set VFIO device related > information. > Defined members of structure and usage on read/write access. > > - Defined device states and added state transition details in the > comment. > > - Added sequence to be followed while saving and resuming VFIO device > state > > Signed-off-by: Kirti Wankhede > Reviewed-by: Neo Jia > --- > include/uapi/linux/vfio.h | 180 > ++ > 1 file changed, 180 insertions(+) > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 9e843a147ead..a0817ba267c1 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type { > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0x) > #define VFIO_REGION_TYPE_GFX(1) > #define VFIO_REGION_TYPE_CCW (2) > +#define VFIO_REGION_TYPE_MIGRATION (3) > > /* sub-types for VFIO_REGION_TYPE_PCI_* */ > > @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid { > /* sub-types for VFIO_REGION_TYPE_CCW */ > #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD (1) > > +/* sub-types for VFIO_REGION_TYPE_MIGRATION */ > +#define VFIO_REGION_SUBTYPE_MIGRATION (1) > + > +/* > + * Structure vfio_device_migration_info is placed at 0th offset of > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related > migration > + * information. Field accesses from this structure are only supported > at their > + * native width and alignment, otherwise the result is undefined and > vendor > + * drivers should return an error. > + * > + * device_state: (read/write) > + * To indicate vendor driver the state VFIO device should be > transitioned > + * to. If device state transition fails, write on this field > return error. > + * It consists of 3 bits: > + * - If bit 0 set, indicates _RUNNING state. When its clear, that > indicates > >>> > >>> s/its/it's/ > >>> > + *_STOP state. When device is changed to _STOP, driver should > stop > + *device before write() returns. > + * - If bit 1 set, indicates _SAVING state. When set, that > indicates driver > + *should start gathering device state information which will be > provided > + *to VFIO user space application to save device's state. > + * - If bit 2 set, indicates _RESUMING state. When set, that > indicates > + *prepare to resume device, data provided through migration > region > + *should be used to resume device. > + * Bits 3 - 31 are reserved for future use. User should perform > + * read-modify-write operation on this field. > + * > + * +--- _RESUMING > + * |+-- _SAVING > + * ||+- _RUNNING > + * ||| > + * 000b => Device Stopped, not saving or resuming > + * 001b => Device running state, default state > + * 010b => Stop Device & save device state, stop-and-copy state > + * 011b => Device running and save device state, pre-copy state > + * 100b => Device stopped and device state is resuming > + * 101b => Invalid state > >>> > >>> Eventually this would be intended for post-copy, if supported by the > >>> device, right? > >>> > >> > >> No, as per Yan mentioned in earlier version, _RESUMING + _RUNNING can't > >> be used for post-copy. New flag will be required for post-copy. > >> > >> https://www.mail-archive.com/qemu-devel@nongnu.org/msg658768.html > >> > + * 110b => Invalid state > + * 111b => Invalid state > + * > + * State transitions: > + * > + * _RESUMING _RUNNINGPre-copyStop-and-copy _STOP > + *(100b) (001b) (011b)(010b) > (000b) > + * 0. Running or Default state > + * | > + * > + * 1. Normal Shutdown > >>> > >>> Optional, userspace is under no obligation. > >>> > + * |->| > + * > + * 2. Save state or Suspend > + *
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On 12/18/2019 12:13 AM, Alex Williamson wrote: On Tue, 17 Dec 2019 11:58:44 +0530 Kirti Wankhede wrote: On 12/17/2019 4:14 AM, Alex Williamson wrote: On Tue, 17 Dec 2019 01:51:36 +0530 Kirti Wankhede wrote: - Defined MIGRATION region type and sub-type. - Defined vfio_device_migration_info structure which will be placed at 0th offset of migration region to get/set VFIO device related information. Defined members of structure and usage on read/write access. - Defined device states and added state transition details in the comment. - Added sequence to be followed while saving and resuming VFIO device state Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- include/uapi/linux/vfio.h | 180 ++ 1 file changed, 180 insertions(+) diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 9e843a147ead..a0817ba267c1 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type { #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0x) #define VFIO_REGION_TYPE_GFX(1) #define VFIO_REGION_TYPE_CCW (2) +#define VFIO_REGION_TYPE_MIGRATION (3) /* sub-types for VFIO_REGION_TYPE_PCI_* */ @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid { /* sub-types for VFIO_REGION_TYPE_CCW */ #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD(1) +/* sub-types for VFIO_REGION_TYPE_MIGRATION */ +#define VFIO_REGION_SUBTYPE_MIGRATION (1) + +/* + * Structure vfio_device_migration_info is placed at 0th offset of + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration + * information. Field accesses from this structure are only supported at their + * native width and alignment, otherwise the result is undefined and vendor + * drivers should return an error. + * + * device_state: (read/write) + * To indicate vendor driver the state VFIO device should be transitioned + * to. If device state transition fails, write on this field return error. + * It consists of 3 bits: + * - If bit 0 set, indicates _RUNNING state. When its clear, that indicates s/its/it's/ + *_STOP state. When device is changed to _STOP, driver should stop + *device before write() returns. + * - If bit 1 set, indicates _SAVING state. When set, that indicates driver + *should start gathering device state information which will be provided + *to VFIO user space application to save device's state. + * - If bit 2 set, indicates _RESUMING state. When set, that indicates + *prepare to resume device, data provided through migration region + *should be used to resume device. + * Bits 3 - 31 are reserved for future use. User should perform + * read-modify-write operation on this field. + * + * +--- _RESUMING + * |+-- _SAVING + * ||+- _RUNNING + * ||| + * 000b => Device Stopped, not saving or resuming + * 001b => Device running state, default state + * 010b => Stop Device & save device state, stop-and-copy state + * 011b => Device running and save device state, pre-copy state + * 100b => Device stopped and device state is resuming + * 101b => Invalid state Eventually this would be intended for post-copy, if supported by the device, right? No, as per Yan mentioned in earlier version, _RESUMING + _RUNNING can't be used for post-copy. New flag will be required for post-copy. https://www.mail-archive.com/qemu-devel@nongnu.org/msg658768.html + * 110b => Invalid state + * 111b => Invalid state + * + * State transitions: + * + * _RESUMING _RUNNINGPre-copyStop-and-copy _STOP + *(100b) (001b) (011b)(010b) (000b) + * 0. Running or Default state + * | + * + * 1. Normal Shutdown Optional, userspace is under no obligation. + * |->| + * + * 2. Save state or Suspend + * |->|-->| + * + * 3. Save state during live migration + * |--->|>|-->| + * + * 4. Resuming + * |<-| + * + * 5. Resumed + * |->| + * + * 0. Default state of VFIO device is _RUNNNG when VFIO application starts. + * 1. During normal VFIO application shutdown, vfio device state changes + *from _RUNNING to _STOP. We cannot impose this requirement on existing userspace. Userspace may perform this action, but they are not required to and the vendor driver must not require it. Updated comment. + * 2. When VFIO application save state or suspend application, VFIO device + *state transition is from _RUNNING to stop-and-copy state and then to + *_STOP. + *On state transition from _RUNNING to stop-and-copy, dr
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Tue, 17 Dec 2019 11:58:44 +0530 Kirti Wankhede wrote: > On 12/17/2019 4:14 AM, Alex Williamson wrote: > > On Tue, 17 Dec 2019 01:51:36 +0530 > > Kirti Wankhede wrote: > > > >> - Defined MIGRATION region type and sub-type. > >> > >> - Defined vfio_device_migration_info structure which will be placed at 0th > >>offset of migration region to get/set VFIO device related information. > >>Defined members of structure and usage on read/write access. > >> > >> - Defined device states and added state transition details in the comment. > >> > >> - Added sequence to be followed while saving and resuming VFIO device state > >> > >> Signed-off-by: Kirti Wankhede > >> Reviewed-by: Neo Jia > >> --- > >> include/uapi/linux/vfio.h | 180 > >> ++ > >> 1 file changed, 180 insertions(+) > >> > >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > >> index 9e843a147ead..a0817ba267c1 100644 > >> --- a/include/uapi/linux/vfio.h > >> +++ b/include/uapi/linux/vfio.h > >> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type { > >> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0x) > >> #define VFIO_REGION_TYPE_GFX(1) > >> #define VFIO_REGION_TYPE_CCW (2) > >> +#define VFIO_REGION_TYPE_MIGRATION (3) > >> > >> /* sub-types for VFIO_REGION_TYPE_PCI_* */ > >> > >> @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid { > >> /* sub-types for VFIO_REGION_TYPE_CCW */ > >> #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD(1) > >> > >> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */ > >> +#define VFIO_REGION_SUBTYPE_MIGRATION (1) > >> + > >> +/* > >> + * Structure vfio_device_migration_info is placed at 0th offset of > >> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related > >> migration > >> + * information. Field accesses from this structure are only supported at > >> their > >> + * native width and alignment, otherwise the result is undefined and > >> vendor > >> + * drivers should return an error. > >> + * > >> + * device_state: (read/write) > >> + * To indicate vendor driver the state VFIO device should be > >> transitioned > >> + * to. If device state transition fails, write on this field return > >> error. > >> + * It consists of 3 bits: > >> + * - If bit 0 set, indicates _RUNNING state. When its clear, that > >> indicates > > > > s/its/it's/ > > > >> + *_STOP state. When device is changed to _STOP, driver should stop > >> + *device before write() returns. > >> + * - If bit 1 set, indicates _SAVING state. When set, that indicates > >> driver > >> + *should start gathering device state information which will be > >> provided > >> + *to VFIO user space application to save device's state. > >> + * - If bit 2 set, indicates _RESUMING state. When set, that > >> indicates > >> + *prepare to resume device, data provided through migration region > >> + *should be used to resume device. > >> + * Bits 3 - 31 are reserved for future use. User should perform > >> + * read-modify-write operation on this field. > >> + * > >> + * +--- _RESUMING > >> + * |+-- _SAVING > >> + * ||+- _RUNNING > >> + * ||| > >> + * 000b => Device Stopped, not saving or resuming > >> + * 001b => Device running state, default state > >> + * 010b => Stop Device & save device state, stop-and-copy state > >> + * 011b => Device running and save device state, pre-copy state > >> + * 100b => Device stopped and device state is resuming > >> + * 101b => Invalid state > > > > Eventually this would be intended for post-copy, if supported by the > > device, right? > > > > No, as per Yan mentioned in earlier version, _RESUMING + _RUNNING can't > be used for post-copy. New flag will be required for post-copy. > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg658768.html > > >> + * 110b => Invalid state > >> + * 111b => Invalid state > >> + * > >> + * State transitions: > >> + * > >> + * _RESUMING _RUNNINGPre-copyStop-and-copy _STOP > >> + *(100b) (001b) (011b)(010b) (000b) > >> + * 0. Running or Default state > >> + * | > >> + * > >> + * 1. Normal Shutdown > > > > Optional, userspace is under no obligation. > > > >> + * |->| > >> + * > >> + * 2. Save state or Suspend > >> + * |->|-->| > >> + * > >> + * 3. Save state during live migration > >> + * |--->|>|-->| > >> + * > >> + * 4. Resuming > >> + * |<-| > >> + * > >> + * 5. Resumed > >> + * |->| > >> + * > >> + * 0. Default state of VFIO device is _RUNNNG when VFIO application > >> starts. >
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Tue, Dec 17, 2019 at 02:28:44PM +0800, Kirti Wankhede wrote: > > > On 12/17/2019 4:14 AM, Alex Williamson wrote: > > On Tue, 17 Dec 2019 01:51:36 +0530 > > Kirti Wankhede wrote: > > > >> - Defined MIGRATION region type and sub-type. > >> > >> - Defined vfio_device_migration_info structure which will be placed at 0th > >>offset of migration region to get/set VFIO device related information. > >>Defined members of structure and usage on read/write access. > >> > >> - Defined device states and added state transition details in the comment. > >> > >> - Added sequence to be followed while saving and resuming VFIO device state > >> > >> Signed-off-by: Kirti Wankhede > >> Reviewed-by: Neo Jia > >> --- > >> include/uapi/linux/vfio.h | 180 > >> ++ > >> 1 file changed, 180 insertions(+) > >> > >> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > >> index 9e843a147ead..a0817ba267c1 100644 > >> --- a/include/uapi/linux/vfio.h > >> +++ b/include/uapi/linux/vfio.h > >> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type { > >> #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0x) > >> #define VFIO_REGION_TYPE_GFX(1) > >> #define VFIO_REGION_TYPE_CCW (2) > >> +#define VFIO_REGION_TYPE_MIGRATION (3) > >> > >> /* sub-types for VFIO_REGION_TYPE_PCI_* */ > >> > >> @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid { > >> /* sub-types for VFIO_REGION_TYPE_CCW */ > >> #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD(1) > >> > >> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */ > >> +#define VFIO_REGION_SUBTYPE_MIGRATION (1) > >> + > >> +/* > >> + * Structure vfio_device_migration_info is placed at 0th offset of > >> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related > >> migration > >> + * information. Field accesses from this structure are only supported at > >> their > >> + * native width and alignment, otherwise the result is undefined and > >> vendor > >> + * drivers should return an error. > >> + * > >> + * device_state: (read/write) > >> + * To indicate vendor driver the state VFIO device should be > >> transitioned > >> + * to. If device state transition fails, write on this field return > >> error. > >> + * It consists of 3 bits: > >> + * - If bit 0 set, indicates _RUNNING state. When its clear, that > >> indicates > > > > s/its/it's/ > > > >> + *_STOP state. When device is changed to _STOP, driver should stop > >> + *device before write() returns. > >> + * - If bit 1 set, indicates _SAVING state. When set, that indicates > >> driver > >> + *should start gathering device state information which will be > >> provided > >> + *to VFIO user space application to save device's state. > >> + * - If bit 2 set, indicates _RESUMING state. When set, that > >> indicates > >> + *prepare to resume device, data provided through migration region > >> + *should be used to resume device. > >> + * Bits 3 - 31 are reserved for future use. User should perform > >> + * read-modify-write operation on this field. > >> + * > >> + * +--- _RESUMING > >> + * |+-- _SAVING > >> + * ||+- _RUNNING > >> + * ||| > >> + * 000b => Device Stopped, not saving or resuming > >> + * 001b => Device running state, default state > >> + * 010b => Stop Device & save device state, stop-and-copy state > >> + * 011b => Device running and save device state, pre-copy state > >> + * 100b => Device stopped and device state is resuming > >> + * 101b => Invalid state > > > > Eventually this would be intended for post-copy, if supported by the > > device, right? > > > > No, as per Yan mentioned in earlier version, _RESUMING + _RUNNING can't > be used for post-copy. New flag will be required for post-copy. > > https://www.mail-archive.com/qemu-devel@nongnu.org/msg658768.html sorry, I didn't mean _RESUMING + _RUNNING can't be used for post-copy. I just mean another POSCOPY state needs to be introduced. But I'm not sure what _RESUMING state is for. actually, we do nothing in response to _RESUMING state, no matter precopy or poscopy. If in your side_RESUMING state means it is allowed to restore device data, then I think _RESUMING + _RUNNING is a valid state for postcopy. Thanks Yan
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On 12/17/2019 4:14 AM, Alex Williamson wrote: On Tue, 17 Dec 2019 01:51:36 +0530 Kirti Wankhede wrote: - Defined MIGRATION region type and sub-type. - Defined vfio_device_migration_info structure which will be placed at 0th offset of migration region to get/set VFIO device related information. Defined members of structure and usage on read/write access. - Defined device states and added state transition details in the comment. - Added sequence to be followed while saving and resuming VFIO device state Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- include/uapi/linux/vfio.h | 180 ++ 1 file changed, 180 insertions(+) diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 9e843a147ead..a0817ba267c1 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type { #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0x) #define VFIO_REGION_TYPE_GFX(1) #define VFIO_REGION_TYPE_CCW (2) +#define VFIO_REGION_TYPE_MIGRATION (3) /* sub-types for VFIO_REGION_TYPE_PCI_* */ @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid { /* sub-types for VFIO_REGION_TYPE_CCW */ #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD (1) +/* sub-types for VFIO_REGION_TYPE_MIGRATION */ +#define VFIO_REGION_SUBTYPE_MIGRATION (1) + +/* + * Structure vfio_device_migration_info is placed at 0th offset of + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration + * information. Field accesses from this structure are only supported at their + * native width and alignment, otherwise the result is undefined and vendor + * drivers should return an error. + * + * device_state: (read/write) + * To indicate vendor driver the state VFIO device should be transitioned + * to. If device state transition fails, write on this field return error. + * It consists of 3 bits: + * - If bit 0 set, indicates _RUNNING state. When its clear, that indicates s/its/it's/ + *_STOP state. When device is changed to _STOP, driver should stop + *device before write() returns. + * - If bit 1 set, indicates _SAVING state. When set, that indicates driver + *should start gathering device state information which will be provided + *to VFIO user space application to save device's state. + * - If bit 2 set, indicates _RESUMING state. When set, that indicates + *prepare to resume device, data provided through migration region + *should be used to resume device. + * Bits 3 - 31 are reserved for future use. User should perform + * read-modify-write operation on this field. + * + * +--- _RESUMING + * |+-- _SAVING + * ||+- _RUNNING + * ||| + * 000b => Device Stopped, not saving or resuming + * 001b => Device running state, default state + * 010b => Stop Device & save device state, stop-and-copy state + * 011b => Device running and save device state, pre-copy state + * 100b => Device stopped and device state is resuming + * 101b => Invalid state Eventually this would be intended for post-copy, if supported by the device, right? No, as per Yan mentioned in earlier version, _RESUMING + _RUNNING can't be used for post-copy. New flag will be required for post-copy. https://www.mail-archive.com/qemu-devel@nongnu.org/msg658768.html + * 110b => Invalid state + * 111b => Invalid state + * + * State transitions: + * + * _RESUMING _RUNNINGPre-copyStop-and-copy _STOP + *(100b) (001b) (011b)(010b) (000b) + * 0. Running or Default state + * | + * + * 1. Normal Shutdown Optional, userspace is under no obligation. + * |->| + * + * 2. Save state or Suspend + * |->|-->| + * + * 3. Save state during live migration + * |--->|>|-->| + * + * 4. Resuming + * |<-| + * + * 5. Resumed + * |->| + * + * 0. Default state of VFIO device is _RUNNNG when VFIO application starts. + * 1. During normal VFIO application shutdown, vfio device state changes + *from _RUNNING to _STOP. We cannot impose this requirement on existing userspace. Userspace may perform this action, but they are not required to and the vendor driver must not require it. Updated comment. + * 2. When VFIO application save state or suspend application, VFIO device + *state transition is from _RUNNING to stop-and-copy state and then to + *_STOP. + *On state transition from _RUNNING to stop-and-copy, driver must + *stop device, save device state and send it to application through + *migration region. + *On _RUNNING to s
Re: [PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
On Tue, 17 Dec 2019 01:51:36 +0530 Kirti Wankhede wrote: > - Defined MIGRATION region type and sub-type. > > - Defined vfio_device_migration_info structure which will be placed at 0th > offset of migration region to get/set VFIO device related information. > Defined members of structure and usage on read/write access. > > - Defined device states and added state transition details in the comment. > > - Added sequence to be followed while saving and resuming VFIO device state > > Signed-off-by: Kirti Wankhede > Reviewed-by: Neo Jia > --- > include/uapi/linux/vfio.h | 180 > ++ > 1 file changed, 180 insertions(+) > > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h > index 9e843a147ead..a0817ba267c1 100644 > --- a/include/uapi/linux/vfio.h > +++ b/include/uapi/linux/vfio.h > @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type { > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0x) > #define VFIO_REGION_TYPE_GFX(1) > #define VFIO_REGION_TYPE_CCW (2) > +#define VFIO_REGION_TYPE_MIGRATION (3) > > /* sub-types for VFIO_REGION_TYPE_PCI_* */ > > @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid { > /* sub-types for VFIO_REGION_TYPE_CCW */ > #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD(1) > > +/* sub-types for VFIO_REGION_TYPE_MIGRATION */ > +#define VFIO_REGION_SUBTYPE_MIGRATION (1) > + > +/* > + * Structure vfio_device_migration_info is placed at 0th offset of > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related > migration > + * information. Field accesses from this structure are only supported at > their > + * native width and alignment, otherwise the result is undefined and vendor > + * drivers should return an error. > + * > + * device_state: (read/write) > + * To indicate vendor driver the state VFIO device should be > transitioned > + * to. If device state transition fails, write on this field return > error. > + * It consists of 3 bits: > + * - If bit 0 set, indicates _RUNNING state. When its clear, that > indicates s/its/it's/ > + *_STOP state. When device is changed to _STOP, driver should stop > + *device before write() returns. > + * - If bit 1 set, indicates _SAVING state. When set, that indicates > driver > + *should start gathering device state information which will be > provided > + *to VFIO user space application to save device's state. > + * - If bit 2 set, indicates _RESUMING state. When set, that indicates > + *prepare to resume device, data provided through migration region > + *should be used to resume device. > + * Bits 3 - 31 are reserved for future use. User should perform > + * read-modify-write operation on this field. > + * > + * +--- _RESUMING > + * |+-- _SAVING > + * ||+- _RUNNING > + * ||| > + * 000b => Device Stopped, not saving or resuming > + * 001b => Device running state, default state > + * 010b => Stop Device & save device state, stop-and-copy state > + * 011b => Device running and save device state, pre-copy state > + * 100b => Device stopped and device state is resuming > + * 101b => Invalid state Eventually this would be intended for post-copy, if supported by the device, right? > + * 110b => Invalid state > + * 111b => Invalid state > + * > + * State transitions: > + * > + * _RESUMING _RUNNINGPre-copyStop-and-copy _STOP > + *(100b) (001b) (011b)(010b) (000b) > + * 0. Running or Default state > + * | > + * > + * 1. Normal Shutdown Optional, userspace is under no obligation. > + * |->| > + * > + * 2. Save state or Suspend > + * |->|-->| > + * > + * 3. Save state during live migration > + * |--->|>|-->| > + * > + * 4. Resuming > + * |<-| > + * > + * 5. Resumed > + * |->| > + * > + * 0. Default state of VFIO device is _RUNNNG when VFIO application starts. > + * 1. During normal VFIO application shutdown, vfio device state changes > + *from _RUNNING to _STOP. We cannot impose this requirement on existing userspace. Userspace may perform this action, but they are not required to and the vendor driver must not require it. > + * 2. When VFIO application save state or suspend application, VFIO device > + *state transition is from _RUNNING to stop-and-copy state and then to > + *_STOP. > + *On state transition from _RUNNING to stop-and-copy, driver must > + *stop device, save device state and send it to application through > + *migration region. > + *On _RUNNING to stop-and-copy state transition failure, application > should > + *set VFIO de
[PATCH v10 Kernel 1/5] vfio: KABI for migration interface for device state
- Defined MIGRATION region type and sub-type. - Defined vfio_device_migration_info structure which will be placed at 0th offset of migration region to get/set VFIO device related information. Defined members of structure and usage on read/write access. - Defined device states and added state transition details in the comment. - Added sequence to be followed while saving and resuming VFIO device state Signed-off-by: Kirti Wankhede Reviewed-by: Neo Jia --- include/uapi/linux/vfio.h | 180 ++ 1 file changed, 180 insertions(+) diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h index 9e843a147ead..a0817ba267c1 100644 --- a/include/uapi/linux/vfio.h +++ b/include/uapi/linux/vfio.h @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type { #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0x) #define VFIO_REGION_TYPE_GFX(1) #define VFIO_REGION_TYPE_CCW (2) +#define VFIO_REGION_TYPE_MIGRATION (3) /* sub-types for VFIO_REGION_TYPE_PCI_* */ @@ -379,6 +380,185 @@ struct vfio_region_gfx_edid { /* sub-types for VFIO_REGION_TYPE_CCW */ #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD (1) +/* sub-types for VFIO_REGION_TYPE_MIGRATION */ +#define VFIO_REGION_SUBTYPE_MIGRATION (1) + +/* + * Structure vfio_device_migration_info is placed at 0th offset of + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration + * information. Field accesses from this structure are only supported at their + * native width and alignment, otherwise the result is undefined and vendor + * drivers should return an error. + * + * device_state: (read/write) + * To indicate vendor driver the state VFIO device should be transitioned + * to. If device state transition fails, write on this field return error. + * It consists of 3 bits: + * - If bit 0 set, indicates _RUNNING state. When its clear, that indicates + *_STOP state. When device is changed to _STOP, driver should stop + *device before write() returns. + * - If bit 1 set, indicates _SAVING state. When set, that indicates driver + *should start gathering device state information which will be provided + *to VFIO user space application to save device's state. + * - If bit 2 set, indicates _RESUMING state. When set, that indicates + *prepare to resume device, data provided through migration region + *should be used to resume device. + * Bits 3 - 31 are reserved for future use. User should perform + * read-modify-write operation on this field. + * + * +--- _RESUMING + * |+-- _SAVING + * ||+- _RUNNING + * ||| + * 000b => Device Stopped, not saving or resuming + * 001b => Device running state, default state + * 010b => Stop Device & save device state, stop-and-copy state + * 011b => Device running and save device state, pre-copy state + * 100b => Device stopped and device state is resuming + * 101b => Invalid state + * 110b => Invalid state + * 111b => Invalid state + * + * State transitions: + * + * _RESUMING _RUNNINGPre-copyStop-and-copy _STOP + *(100b) (001b) (011b)(010b) (000b) + * 0. Running or Default state + * | + * + * 1. Normal Shutdown + * |->| + * + * 2. Save state or Suspend + * |->|-->| + * + * 3. Save state during live migration + * |--->|>|-->| + * + * 4. Resuming + * |<-| + * + * 5. Resumed + * |->| + * + * 0. Default state of VFIO device is _RUNNNG when VFIO application starts. + * 1. During normal VFIO application shutdown, vfio device state changes + *from _RUNNING to _STOP. + * 2. When VFIO application save state or suspend application, VFIO device + *state transition is from _RUNNING to stop-and-copy state and then to + *_STOP. + *On state transition from _RUNNING to stop-and-copy, driver must + *stop device, save device state and send it to application through + *migration region. + *On _RUNNING to stop-and-copy state transition failure, application should + *set VFIO device state to _RUNNING. + * 3. In VFIO application live migration, state transition is from _RUNNING + *to pre-copy to stop-and-copy to _STOP. + *On state transition from _RUNNING to pre-copy, driver should start + *gathering device state while application is still running and send device + *state data to application through migration region. + *On state transition from pre-copy to stop-and-copy, driver must stop + *device, save device state and send it to application through migration + *region. + *On any failure during any of these state transition, VFIO de