Issue 613 in ganeti: Incorrectly offlined node leads to instances running on secondary node

ganeti Thu, 14 Nov 2013 10:02:28 -0800

Status: New
Owner: ----

New issue 613 by [email protected]: Incorrectly offlined node leads toinstances running on secondary node

http://code.google.com/p/ganeti/issues/detail?id=613

What software version are you running? Please provide the outputof "gnt-cluster --version", "gnt-cluster version", and "hspace --version".

# gnt-cluster --version
gnt-cluster (ganeti v2.6.2) 2.6.2
# gnt-cluster version
Software version: 2.6.2
Internode protocol: 2060000
Configuration format: 2060000
OS api version: 20
Export interface: 0
# hspace --version
hspace (ganeti-htools) version v2.6.2
compiled with ghc 6.12
running on linux x86_64

What distribution are you using?
Debian squeeze

What steps will reproduce the problem?

Let $NODE be a node that is primary for some instances, and is functioningcorrectly.

1. gnt-node -O yes $NODE
2. gnt-node failover $NODE

What is the expected output? What do you see instead?
I expect one of the following to occur:

1. All instances that were running on $NODE should be running somewhereelse instead, and not on $NODE.2. The failover job should fail, the instances should remain running on$NODE, and the cluster should consider $NODE to be the primary for theseinstances.

Instead, what happens is the failover fails due to DRBD errors[1], the VMcontinues running on $NODE, and ganeti updates what it thinks is theprimary node, so that $NODE is now the secondary. Now DRBD is disconnected,and the instance is running on its secondary node. Activate-disks willfail, because DRBD won't let there be two primaries, and most otheroperations will fail because the "secondary" node of this instance isoffline. To make things more confusing, I saw one case where a subsequentwatcher run was successful in bringing up the instance on its new primarynode as well, so there were two VMs running for that instance.

If the node is marked online again, gnt-instance migrate --cleanup is ableto switch the primary back to the right node, though it doesn't get theDRBD quite right.[2]


Please provide any additional information below.
[1]
# gnt-node failover <primary>
Fail over instance(s) '<instance>'?
y/[n]/?: y
Submitted jobs 8552
Waiting for job 8552 for <instance> ...
Thu Nov 14 17:34:27 2013 Failover instance <instance>

Thu Nov 14 17:34:27 2013 * checking disk consistency between source andtarget

Thu Nov 14 17:34:27 2013 * shutting down instance on source node

Thu Nov 14 17:34:27 2013 - WARNING: Could not shutdown instance <instance>on node <primary>, proceeding anyway; please make sure node <primary> isdown; error details: Node is marked offline

Thu Nov 14 17:34:27 2013 * deactivating the instance's disks on source node

Thu Nov 14 17:34:27 2013 - WARNING: Could not shutdown block device disk/0on node <primary>: Node is marked offlineThu Nov 14 17:34:28 2013 * activating the instance's disks on target node<secondary>Thu Nov 14 17:34:28 2013 - WARNING: Could not prepare block device disk/0on node <primary> (is_primary=False, pass=1): Node is marked offlineThu Nov 14 17:34:30 2013 - WARNING: Could not prepare block device disk/0on node <secondary> (is_primary=True, pass=2): Error while assembling disk:drbd0: can't make drbd device primary: /dev/drbd0: State change failed:

(-1) Multiple primaries not allowed by config\n

Thu Nov 14 17:34:30 2013 - WARNING: Could not shutdown block device disk/0on node <primary>: Node is marked offline

Job 8552 for <instance> has failed: Failure: command execution error:
Can't activate the instance's disks
There were errors during the failover:
1 error(s) out of 1 instance(s).

/proc/drbd on <primary>:
# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: 2D876214BAAD53B31ADC1D6
 0: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----

ns:155885796 nr:0 dw:155905748 dr:14985457 al:21624 bm:122 lo:0 pe:0ua:0 ap:0 ep:1 wo:d oos:364


/proc/drbd on <secondary>:
# cat /proc/drbd
version: 8.3.11 (api:88/proto:86-96)
srcversion: 2D876214BAAD53B31ADC1D6
 0: cs:Unconfigured

[2]
# gnt-instance migrate --cleanup <instance>
Instance <instance> will be recovered
from a failed migration. Note that the migration procedure (including
cleanup) might impact the instance if anything goes wrong (e.g. due to
bugs in the hypervisor). Continue?
y/[n]/?: y

Thu Nov 14 17:56:42 2013 - INFO: Not checking memory on the secondary nodeas instance will not be started

Failure: prerequisites not met for this operation:
error type: environment_error, error details:

Error checking bridges on destination node '<primary>': Node is markedoffline

# gnt-node modify -O no <primary>
Thu Nov 14 17:56:55 2013  - INFO: Auto-promoting node to master candidate

Thu Nov 14 17:56:55 2013 - WARNING: Transitioning node from offline toonline state without using re-add. Please make sure the node is healthy!

Modified node <primary>
 - master_candidate -> True
 - offline -> False
# gnt-instance migrate --cleanup <instance>
Instance <instance> will be recovered
from a failed migration. Note that the migration procedure (including
cleanup) might impact the instance if anything goes wrong (e.g. due to
bugs in the hypervisor). Continue?
y/[n]/?: y

Thu Nov 14 17:57:02 2013 - INFO: Not checking memory on the secondary nodeas instance will not be started

Thu Nov 14 17:57:02 2013 Migrating instance <instance>

Thu Nov 14 17:57:02 2013 * checking where the instance actually runs (ifthis hangs, the hypervisor might be in a bad state)Thu Nov 14 17:57:03 2013 * instance running on secondary node (<primary>),updating config

Thu Nov 14 17:57:03 2013 * switching node <secondary> to secondary mode
Failure: command execution error:

Cannot change disk to secondary on node <secondary>: Can't find device<DRBD8(hosts=<primary>/0-<secondary>/0, port=11000, configured as172.16.241.210:11000 172.16.241.209:11000,backend=<LogicalVolume(/dev/xenvg/c740da43-31db-42c4-b94b-f3b51c086c0a.disk0_data,not visible, size=256000m)>,metadev=<LogicalVolume(/dev/xenvg/c740da43-31db-42c4-b94b-f3b51c086c0a.disk0_meta,not visible, size=128m)>, visible as /dev/disk/0, size=256000m)>

--

You received this message because this project is configured to send allissue notifications to this address.

You may adjust your notification preferences at:
https://code.google.com/hosting/settings

Issue 613 in ganeti: Incorrectly offlined node leads to instances running on secondary node

Reply via email to