Status: Accepted
Owner: ----
Labels: Type-Defect Priority-High Milestone-Release2.16
New issue 1189 by [email protected]: master-failover does not check
serial_no of config.data
https://code.google.com/p/ganeti/issues/detail?id=1189
What steps will reproduce the problem?
3 nodes: node1 (master), node2 (mc) , node3 (mc)
node2: service ganeti stop (to simulate a network failure)
node1: gnt-network add --network 10.0.0.0/24 test-net
NOTE: /var/log/ganeti/wconf-daemon.log
2016-09-25 13:40:02,171654000000 EEST: ganeti-wconfd pid=26686/ThreadId 9
ERROR Can't distribute Ssconf: GenericError "At least one of the RPC calls
failed"
node2: service ganeti start
node2: gnt-cluster master-failover
NOTE: test-net does not exist any more!
master-failover on stable-2.16
------------------------------
1. get node list from ssconf
2. for each of these nodes it will request what they believe of the current
master.
bootstrap._GatherMasterVotes(["node1", "node2", "node3"])
[('node1', 3)]
NOTE: In case some of them do not respond it continues with a warning:
ERROR:root:RPC error in master_node_name on node node1: Error 7: Failed
to connect to 10.97.2.100 port 1811: Connection refused
WARNING:root:Error contacting node node2: Error 7: Failed to connect to
10.97.2.100 port 1811: Connection refused
NOTE: If master-candidate based on ssconf thinks that voted_master !=
old_master
then it fails with the followigng message:
"I have a wrong configuration, I believe the master is old_master but the
other nodes
voted voted_master. Please resync the configuration of this node."
*** Here does not check serial_no of config.data at all!!! ***
3. "Setting master to node2, old master: node1"
4. "Forcefully start WConfd so that we can access the configuration":
ganeti-wconfd --force-node --no-voting --yes-do-it
5. Update config.data with new master (this generates ssconf files too)
6. If cfg.Update worked, then it means the old master daemon won't be
able now to write its own config file (we rely on locking in both
backend.UploadFile() and ConfigWriter._Write(); hence the next
step is to kill the old master
7. "Stopping the master daemon on node node1"
RPC call_node_deactivate_master_ip()
RPC node_stop_master()
NOTE: If something of the above fails we just print a warning
8. Stop wconfd that was started forcefully
9. Check that the master IP is not ping-able
10. "Starting the master daemons on the new master"
Possible fix
------------
When asking all nodes about the current master ask them
about the serial_no of the cluster as well.
Deny a master-failover to an older version of the cluster
(i.e. voted_serial_no > old_serial_no)
--
You received this message because this project is configured to send all
issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings