This is my try to make Xen migrations more robust. I'm not entirely sure if
the approach is acceptable, but couldn't come up with a nicer solution within
the current design/limitations.

- FinalizeMigrationSrc/Dst should not depend on each other, they could be
treated as a single operation consisting of 2 idempotent steps (running in
parallel), that could be retried. Especially updating the state of record
between them is bad idea making recovery more complicated than needed.

- Fixed a scenario on Xen, when the instance keeps running on the source in the
renamed domain 'migrating-$name' while on the target an empty paused domain is
constructed. This happens when Xen fails to freeze the kernel on the source
node. Now migrate --cleanup should be able to recover from this failure mode.
The fix unfortunately contains a leaking abstraction, as it explicitely looks
for the renamed version of the instance.

- Fixed a scenario (on Xen) when the instance is running on a node that is not
recorded in config.data. Ganeti can now 'adopt' this instance. This would
previously require editing config.data manually.

- Fixed a scenario (on Xen) when the instance does successfully migrate, but the
xen config doesn't get copied. Unfortunately finalizeMigrationSrc is called
before Dst, while Src deletes the original file and Dst creates a new one. If
the operation is interrupted inbetween, the contents are lost. It is a pretty
painful ops work to recreate that file manually. To regenerate the contents, we
need the list of associated block devices that can't be queried from the
hypervisor backend. My proposed workaround is adding a side effect to
StartInstance, that in case the instance is already running either turns to a
noop as before, or calls RestoreInstance that regenerates the missing file.

Viktor Bachraty (7):
  Code cleanup in hypervisor backend
  Fix instance state detection in _Shutdowninstance
  Make FinalizeMigration{Src,Dst} more robust
  Make finalize_migration_{src,dst} 'atomic'
  Make migrate --cleanup more robust
  Allow migrate --cleanup to adopt an instance
  StartInstance restores instance state if running

 lib/backend.py                    |  17 ++--
 lib/cmdlib/instance_migration.py  | 158 ++++++++++++++++++++++++++------------
 lib/hypervisor/hv_base.py         |  30 +++++++-
 lib/hypervisor/hv_fake.py         |  11 ---
 lib/hypervisor/hv_kvm/__init__.py |  74 ++++++++++--------
 lib/hypervisor/hv_xen.py          | 134 ++++++++++++++++++++++++++------
 6 files changed, 305 insertions(+), 119 deletions(-)

-- 
2.8.0.rc3.226.g39d4020

Reply via email to