Status: New
Owner: ----

New issue 613 by [email protected]: Incorrectly offlined node leads to instances running on secondary node
http://code.google.com/p/ganeti/issues/detail?id=613

What software version are you running? Please provide the output of "gnt-cluster --version", "gnt-cluster version", and "hspace --version".
# gnt-cluster --version
gnt-cluster (ganeti v2.6.2) 2.6.2
# gnt-cluster version
Software version: 2.6.2
Internode protocol: 2060000
Configuration format: 2060000
OS api version: 20
Export interface: 0
# hspace --version
hspace (ganeti-htools) version v2.6.2
compiled with ghc 6.12
running on linux x86_64

What distribution are you using?
Debian squeeze

What steps will reproduce the problem?
Let $NODE be a node that is primary for some instances, and is functioning correctly.
1. gnt-node -O yes $NODE
2. gnt-node failover $NODE

What is the expected output? What do you see instead?
I expect one of the following to occur:
1. All instances that were running on $NODE should be running somewhere else instead, and not on $NODE. 2. The failover job should fail, the instances should remain running on $NODE, and the cluster should consider $NODE to be the primary for these instances.

Instead, what happens is the failover fails due to DRBD errors[1], the VM continues running on $NODE, and ganeti updates what it thinks is the primary node, so that $NODE is now the secondary. Now DRBD is disconnected, and the instance is running on its secondary node. Activate-disks will fail, because DRBD won't let there be two primaries, and most other operations will fail because the "secondary" node of this instance is offline. To make things more confusing, I saw one case where a subsequent watcher run was successful in bringing up the instance on its new primary node as well, so there were two VMs running for that instance.

If the node is marked online again, gnt-instance migrate --cleanup is able to switch the primary back to the right node, though it doesn't get the DRBD quite right.[2]

Please provide any additional information below.
[1]
# gnt-node failover <primary>
Fail over instance(s) '<instance>'?
y/[n]/?: y
Submitted jobs 8552
Waiting for job 8552 for <instance> ...
Thu Nov 14 17:34:27 2013 Failover instance <instance>
Thu Nov 14 17:34:27 2013 * checking disk consistency between source and target
Thu Nov 14 17:34:27 2013 * shutting down instance on source node
Thu Nov 14 17:34:27 2013 - WARNING: Could not shutdown instance <instance> on node <primary>, proceeding anyway; please make sure node <primary> is down; error details: Node is marked offline
Thu Nov 14 17:34:27 2013 * deactivating the instance's disks on source node
Thu Nov 14 17:34:27 2013 - WARNING: Could not shutdown block device disk/0 on node <primary>: Node is marked offline Thu Nov 14 17:34:28 2013 * activating the instance's disks on target node <secondary> Thu Nov 14 17:34:28 2013 - WARNING: Could not prepare block device disk/0 on node <primary> (is_primary=False, pass=1): Node is marked offline Thu Nov 14 17:34:30 2013 - WARNING: Could not prepare block device disk/0 on node <secondary> (is_primary=True, pass=2): Error while assembling disk: drbd0: can't make drbd device primary: /dev/drbd0: State change failed:
(-1) Multiple primaries not allowed by config\n
Thu Nov 14 17:34:30 2013 - WARNING: Could not shutdown block device disk/0 on node <primary>: Node is marked offline
Job 8552 for <instance> has failed: Failure: command execution error:
Can't activate the instance's disks
There were errors during the failover:
1 error(s) out of 1 instance(s).

/proc/drbd on <primary>:
# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: 2D876214BAAD53B31ADC1D6
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
ns:155885796 nr:0 dw:155905748 dr:14985457 al:21624 bm:122 lo:0 pe:0 ua:0 ap:0 ep:1 wo:d oos:364

/proc/drbd on <secondary>:
# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: 2D876214BAAD53B31ADC1D6
 0: cs:Unconfigured

[2]
# gnt-instance migrate --cleanup <instance>
Instance <instance> will be recovered
from a failed migration. Note that the migration procedure (including
cleanup) might impact the instance if anything goes wrong (e.g. due to
bugs in the hypervisor). Continue?
y/[n]/?: y
Thu Nov 14 17:56:42 2013 - INFO: Not checking memory on the secondary node as instance will not be started
Failure: prerequisites not met for this operation:
error type: environment_error, error details:
Error checking bridges on destination node '<primary>': Node is marked offline
# gnt-node modify -O no <primary>
Thu Nov 14 17:56:55 2013  - INFO: Auto-promoting node to master candidate
Thu Nov 14 17:56:55 2013 - WARNING: Transitioning node from offline to online state without using re-add. Please make sure the node is healthy!
Modified node <primary>
 - master_candidate -> True
 - offline -> False
# gnt-instance migrate --cleanup <instance>
Instance <instance> will be recovered
from a failed migration. Note that the migration procedure (including
cleanup) might impact the instance if anything goes wrong (e.g. due to
bugs in the hypervisor). Continue?
y/[n]/?: y
Thu Nov 14 17:57:02 2013 - INFO: Not checking memory on the secondary node as instance will not be started
Thu Nov 14 17:57:02 2013 Migrating instance <instance>
Thu Nov 14 17:57:02 2013 * checking where the instance actually runs (if this hangs, the hypervisor might be in a bad state) Thu Nov 14 17:57:03 2013 * instance running on secondary node (<primary>), updating config
Thu Nov 14 17:57:03 2013 * switching node <secondary> to secondary mode
Failure: command execution error:
Cannot change disk to secondary on node <secondary>: Can't find device <DRBD8(hosts=<primary>/0-<secondary>/0, port=11000, configured as 172.16.241.210:11000 172.16.241.209:11000, backend=<LogicalVolume(/dev/xenvg/c740da43-31db-42c4-b94b-f3b51c086c0a.disk0_data, not visible, size=256000m)>, metadev=<LogicalVolume(/dev/xenvg/c740da43-31db-42c4-b94b-f3b51c086c0a.disk0_meta, not visible, size=128m)>, visible as /dev/disk/0, size=256000m)>

--
You received this message because this project is configured to send all issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

Reply via email to