After upgrading ganeti from 2.6.2 to 2.8.1, which got rid of my "sleep"
patches, the very first live migrate I did suffered the same problem:
root@wrn-vm1:~# gnt-instance migrate testvm.int.example.com
Instance testvm.int.example.com will be migrated. Note that
migration might impact the instance if anything goes wrong (e.g. due
to bugs in the hypervisor). Continue?
y/[n]/?: y
Tue Nov 5 11:08:30 2013 Migrating instance testvm.int.example.com
Tue Nov 5 11:08:30 2013 * checking disk consistency between source and
target
Tue Nov 5 11:08:31 2013 * switching node wrn-vm2.int.example.com to
secondary mode
Tue Nov 5 11:08:31 2013 * changing into standalone mode
Tue Nov 5 11:08:31 2013 * changing disks into dual-master mode
Tue Nov 5 11:08:32 2013 * wait until resync is done
Failure: command execution error:
Cannot resync disks on node wrn-vm2.int.example.com: DRBD device <<class
'ganeti.bdev.DRBD8'>: unique_id: ('192.168.8.102', 11001, '192.168.8.101',
11001, 0, 'c3e57192c295a443428f68ec8f9ff141f692f78f'), children: [<<class
'ganeti.bdev.LogicalVolume'>: unique_id: ('xenvg',
'10e24636-0481-4ded-b1d7-b9f97b2919d2.disk0_data'), children: [], 253:7,
/dev/xenvg/10e24636-0481-4ded-b1d7-b9f97b2919d2.disk0_data>, <<class
'ganeti.bdev.LogicalVolume'>: unique_id: ('xenvg',
'10e24636-0481-4ded-b1d7-b9f97b2919d2.disk0_meta'), children: [], 253:8,
/dev/xenvg/10e24636-0481-4ded-b1d7-b9f97b2919d2.disk0_meta>], 147:0,
/dev/drbd0> is not in sync: stats=<ganeti.bdev.DRBD8Status object at
0x23d6c10>
As a reminder, the underlying platform is Debian Wheezy on Dell R210-II
with 10G back-to-back for the replication network.
I would say this is a very high priority issue, given that Ganeti is
apparently driving DRBD wrongly and it can cause these failures on more
installations than just mine. The detailled post at
http://lists.linbit.com/pipermail/drbd-user/2013-July/020173.html has all
the details. Personally I'd say that Ganeti should be using the drbd "dual
master" mode (which means both sides *may* be promoted to master, but don't
*have* to be), but fixing the current disconnect/reconnect approach is also
described in that post.
Anyway, I have put back my bandaid and it's OK for now:
--- /usr/local/lib/python2.7/dist-packages/ganeti/backend.py.orig 2013-11-05
10:58:42.161871214 +0000
+++ /usr/local/lib/python2.7/dist-packages/ganeti/backend.py 2013-11-05
11:18:37.614989450 +0000
@@ -3607,6 +3607,7 @@
for rd in bdevs:
try:
rd.AttachNet(multimaster)
+ time.sleep(0.5)
except errors.BlockDeviceError, err:
_Fail("Can't change network configuration: %s", err)