Status: New
Owner: ----
New issue 613 by [email protected]: Incorrectly offlined node leads to
instances running on secondary node
http://code.google.com/p/ganeti/issues/detail?id=613
What software version are you running? Please provide the output
of "gnt-cluster --version", "gnt-cluster version", and "hspace --version".
# gnt-cluster --version
gnt-cluster (ganeti v2.6.2) 2.6.2
# gnt-cluster version
Software version: 2.6.2
Internode protocol: 2060000
Configuration format: 2060000
OS api version: 20
Export interface: 0
# hspace --version
hspace (ganeti-htools) version v2.6.2
compiled with ghc 6.12
running on linux x86_64
What distribution are you using?
Debian squeeze
What steps will reproduce the problem?
Let $NODE be a node that is primary for some instances, and is functioning
correctly.
1. gnt-node -O yes $NODE
2. gnt-node failover $NODE
What is the expected output? What do you see instead?
I expect one of the following to occur:
1. All instances that were running on $NODE should be running somewhere
else instead, and not on $NODE.
2. The failover job should fail, the instances should remain running on
$NODE, and the cluster should consider $NODE to be the primary for these
instances.
Instead, what happens is the failover fails due to DRBD errors[1], the VM
continues running on $NODE, and ganeti updates what it thinks is the
primary node, so that $NODE is now the secondary. Now DRBD is disconnected,
and the instance is running on its secondary node. Activate-disks will
fail, because DRBD won't let there be two primaries, and most other
operations will fail because the "secondary" node of this instance is
offline. To make things more confusing, I saw one case where a subsequent
watcher run was successful in bringing up the instance on its new primary
node as well, so there were two VMs running for that instance.
If the node is marked online again, gnt-instance migrate --cleanup is able
to switch the primary back to the right node, though it doesn't get the
DRBD quite right.[2]
Please provide any additional information below.
[1]
# gnt-node failover <primary>
Fail over instance(s) '<instance>'?
y/[n]/?: y
Submitted jobs 8552
Waiting for job 8552 for <instance> ...
Thu Nov 14 17:34:27 2013 Failover instance <instance>
Thu Nov 14 17:34:27 2013 * checking disk consistency between source and
target
Thu Nov 14 17:34:27 2013 * shutting down instance on source node
Thu Nov 14 17:34:27 2013 - WARNING: Could not shutdown instance <instance>
on node <primary>, proceeding anyway; please make sure node <primary> is
down; error details: Node is marked offline
Thu Nov 14 17:34:27 2013 * deactivating the instance's disks on source node
Thu Nov 14 17:34:27 2013 - WARNING: Could not shutdown block device disk/0
on node <primary>: Node is marked offline
Thu Nov 14 17:34:28 2013 * activating the instance's disks on target node
<secondary>
Thu Nov 14 17:34:28 2013 - WARNING: Could not prepare block device disk/0
on node <primary> (is_primary=False, pass=1): Node is marked offline
Thu Nov 14 17:34:30 2013 - WARNING: Could not prepare block device disk/0
on node <secondary> (is_primary=True, pass=2): Error while assembling disk:
drbd0: can't make drbd device primary: /dev/drbd0: State change failed:
(-1) Multiple primaries not allowed by config\n
Thu Nov 14 17:34:30 2013 - WARNING: Could not shutdown block device disk/0
on node <primary>: Node is marked offline
Job 8552 for <instance> has failed: Failure: command execution error:
Can't activate the instance's disks
There were errors during the failover:
1 error(s) out of 1 instance(s).
/proc/drbd on <primary>:
# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: 2D876214BAAD53B31ADC1D6
0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
ns:155885796 nr:0 dw:155905748 dr:14985457 al:21624 bm:122 lo:0 pe:0
ua:0 ap:0 ep:1 wo:d oos:364
/proc/drbd on <secondary>:
# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: 2D876214BAAD53B31ADC1D6
0: cs:Unconfigured
[2]
# gnt-instance migrate --cleanup <instance>
Instance <instance> will be recovered
from a failed migration. Note that the migration procedure (including
cleanup) might impact the instance if anything goes wrong (e.g. due to
bugs in the hypervisor). Continue?
y/[n]/?: y
Thu Nov 14 17:56:42 2013 - INFO: Not checking memory on the secondary node
as instance will not be started
Failure: prerequisites not met for this operation:
error type: environment_error, error details:
Error checking bridges on destination node '<primary>': Node is marked
offline
# gnt-node modify -O no <primary>
Thu Nov 14 17:56:55 2013 - INFO: Auto-promoting node to master candidate
Thu Nov 14 17:56:55 2013 - WARNING: Transitioning node from offline to
online state without using re-add. Please make sure the node is healthy!
Modified node <primary>
- master_candidate -> True
- offline -> False
# gnt-instance migrate --cleanup <instance>
Instance <instance> will be recovered
from a failed migration. Note that the migration procedure (including
cleanup) might impact the instance if anything goes wrong (e.g. due to
bugs in the hypervisor). Continue?
y/[n]/?: y
Thu Nov 14 17:57:02 2013 - INFO: Not checking memory on the secondary node
as instance will not be started
Thu Nov 14 17:57:02 2013 Migrating instance <instance>
Thu Nov 14 17:57:02 2013 * checking where the instance actually runs (if
this hangs, the hypervisor might be in a bad state)
Thu Nov 14 17:57:03 2013 * instance running on secondary node (<primary>),
updating config
Thu Nov 14 17:57:03 2013 * switching node <secondary> to secondary mode
Failure: command execution error:
Cannot change disk to secondary on node <secondary>: Can't find device
<DRBD8(hosts=<primary>/0-<secondary>/0, port=11000, configured as
172.16.241.210:11000 172.16.241.209:11000,
backend=<LogicalVolume(/dev/xenvg/c740da43-31db-42c4-b94b-f3b51c086c0a.disk0_data,
not visible, size=256000m)>,
metadev=<LogicalVolume(/dev/xenvg/c740da43-31db-42c4-b94b-f3b51c086c0a.disk0_meta,
not visible, size=128m)>, visible as /dev/disk/0, size=256000m)>
--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings