Status: Accepted
Owner: ----
Labels: Type-Defect Priority-High Milestone-Release2.16

New issue 1189 by [email protected]: master-failover does not check serial_no of config.data
https://code.google.com/p/ganeti/issues/detail?id=1189

What steps will reproduce the problem?

3 nodes: node1 (master), node2 (mc) , node3 (mc)

node2: service ganeti stop (to simulate a network failure)

node1: gnt-network add --network 10.0.0.0/24 test-net

NOTE: /var/log/ganeti/wconf-daemon.log

2016-09-25 13:40:02,171654000000 EEST: ganeti-wconfd pid=26686/ThreadId 9 ERROR Can't distribute Ssconf: GenericError "At least one of the RPC calls failed"

node2: service ganeti start
node2: gnt-cluster master-failover

NOTE: test-net does not exist any more!

master-failover on stable-2.16
------------------------------

1. get node list from ssconf

2. for each of these nodes it will request what they believe of the current master.

   bootstrap._GatherMasterVotes(["node1", "node2", "node3"])
   [('node1', 3)]

NOTE: In case some of them do not respond it continues with a warning:

ERROR:root:RPC error in master_node_name on node node1: Error 7: Failed to connect to 10.97.2.100 port 1811: Connection refused WARNING:root:Error contacting node node2: Error 7: Failed to connect to 10.97.2.100 port 1811: Connection refused

NOTE: If master-candidate based on ssconf thinks that voted_master != old_master
      then it fails with the followigng message:

"I have a wrong configuration, I believe the master is old_master but the other nodes
  voted voted_master. Please resync the configuration of this node."


*** Here does not check serial_no of config.data at all!!! ***

3. "Setting master to node2, old master: node1"

4. "Forcefully start WConfd so that we can access the configuration":
   ganeti-wconfd --force-node --no-voting --yes-do-it

5. Update config.data with new master (this generates ssconf files too)

6. If cfg.Update worked, then it means the old master daemon won't be
   able now to write its own config file (we rely on locking in both
   backend.UploadFile() and ConfigWriter._Write(); hence the next
   step is to kill the old master

7. "Stopping the master daemon on node node1"
   RPC call_node_deactivate_master_ip()
   RPC node_stop_master()

NOTE: If something of the above fails we just print a warning

8. Stop wconfd that was started forcefully

9. Check that the master IP is not ping-able

10. "Starting the master daemons on the new master"


Possible fix
------------

When asking all nodes about the current master ask them
about the serial_no of the cluster as well.

Deny a master-failover to an older version of the cluster
(i.e. voted_serial_no > old_serial_no)

--
You received this message because this project is configured to send all issue notifications to this address.
You may adjust your notification preferences at:
https://code.google.com/hosting/settings

Reply via email to