After upgrading ganeti from 2.6.2 to 2.8.1, which got rid of my "sleep" 
patches, the very first live migrate I did suffered the same problem:

root@wrn-vm1:~# gnt-instance migrate testvm.int.example.com
Instance testvm.int.example.com will be migrated. Note that
migration might impact the instance if anything goes wrong (e.g. due
to bugs in the hypervisor). Continue?
y/[n]/?: y
Tue Nov  5 11:08:30 2013 Migrating instance testvm.int.example.com
Tue Nov  5 11:08:30 2013 * checking disk consistency between source and 
target
Tue Nov  5 11:08:31 2013 * switching node wrn-vm2.int.example.com to 
secondary mode
Tue Nov  5 11:08:31 2013 * changing into standalone mode
Tue Nov  5 11:08:31 2013 * changing disks into dual-master mode
Tue Nov  5 11:08:32 2013 * wait until resync is done
Failure: command execution error:
Cannot resync disks on node wrn-vm2.int.example.com: DRBD device <<class 
'ganeti.bdev.DRBD8'>: unique_id: ('192.168.8.102', 11001, '192.168.8.101', 
11001, 0, 'c3e57192c295a443428f68ec8f9ff141f692f78f'), children: [<<class 
'ganeti.bdev.LogicalVolume'>: unique_id: ('xenvg', 
'10e24636-0481-4ded-b1d7-b9f97b2919d2.disk0_data'), children: [], 253:7, 
/dev/xenvg/10e24636-0481-4ded-b1d7-b9f97b2919d2.disk0_data>, <<class 
'ganeti.bdev.LogicalVolume'>: unique_id: ('xenvg', 
'10e24636-0481-4ded-b1d7-b9f97b2919d2.disk0_meta'), children: [], 253:8, 
/dev/xenvg/10e24636-0481-4ded-b1d7-b9f97b2919d2.disk0_meta>], 147:0, 
/dev/drbd0> is not in sync: stats=<ganeti.bdev.DRBD8Status object at 
0x23d6c10>

As a reminder, the underlying platform is Debian Wheezy on Dell R210-II 
with 10G back-to-back for the replication network.

I would say this is a very high priority issue, given that Ganeti is 
apparently driving DRBD wrongly and it can cause these failures on more 
installations than just mine. The detailled post at 
http://lists.linbit.com/pipermail/drbd-user/2013-July/020173.html has all 
the details. Personally I'd say that Ganeti should be using the drbd "dual 
master" mode (which means both sides *may* be promoted to master, but don't 
*have* to be), but fixing the current disconnect/reconnect approach is also 
described in that post.

Anyway, I have put back my bandaid and it's OK for now:

--- /usr/local/lib/python2.7/dist-packages/ganeti/backend.py.orig 2013-11-05 
10:58:42.161871214 +0000
+++ /usr/local/lib/python2.7/dist-packages/ganeti/backend.py 2013-11-05 
11:18:37.614989450 +0000
@@ -3607,6 +3607,7 @@
   for rd in bdevs:
     try:
       rd.AttachNet(multimaster)
+      time.sleep(0.5)
     except errors.BlockDeviceError, err:
       _Fail("Can't change network configuration: %s", err)


Reply via email to