[ https://issues.apache.org/jira/browse/ZOOKEEPER-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011450#comment-13011450 ]
Vishal K commented on ZOOKEEPER-107: ------------------------------------ Hi Alex, Thanks for the updated design. Some comments below. Some of these might be repetition of what we discussed earlier. I am adding them here so that we have it on the jira. {quote} D. The purpose of connecting is only to begin state transfer, so only connecting to the leader matters. for members(M) to detect failure of new servers that are trying to connect. If the leader fails then the new leader can be responsible for establishing connections to members(M') as part of completing the reconfiguration (see first step in Section 4.2). {quote} What will M' do if the leader(M) fails? We could run into a scenario where M' don't know about leader(M) and leader(M) may not know about M'. So it might be easier to ask M' to connect to M. Essentially, we need a way to figure out who the current leader is. We could potentially run the virtual IP based approach that I mentioned earlier. Let the leader runs a "cluster IP" and anyone (including ZK clients) that wants to get cluster/leader information can send query to this IP. {quote} the old leader can appoint one of members(M') to be the new leader (the new leader should be chosen s.t. it is up-to-date with M). This way leader election in M' is only required if the designated leader fails. {quote} I am not in favor of this approach. May I suggest that from implementation perspective, we leave this in future work/ improvements section? It certainly is a good optimization, but I don't see it giving us significant benefits. IMHO, it seems a bit counter intuitive for node x to designate node y as a leader in a distributed protocol that is designed to elect leaders. I find it easier to think "elect a leader if you need one". I am also concerned that this is going to make the implementation prone to strange corner cases and more complex (to test as well). {quote} E. This is very related to ZOOKEEPER-22 (how does a client know whether its operation has been executed if the leader failed) ? {quote} Agreed. However, the client library attempts to reconnect to the servers. Also, the application can verify if the transaction is done when it reconnects next time. We may have to do something similar as well. {quote} reconfiguration attempt and completed it or no further leader will (if we fix ZOOKEEPER-335 ) {quote} Why should we not fix ZOOKEEPER-335? Won't the log divergence cause unpredictable outcome of reconfiguration? log of A has M' and log of B has M''. Depending upon who wins election the configuration will be either M' or M''. {quote} I think a reconfig operations should be sent to the leader only after a majority of M' has connected to it. {quote} Sure, though (as you pointed out) this is not strictly necessary for the correctness of the protocol. {quote} Here are some issues that I see with your proposal: 1. Suppose that leader(M) fails while sending phase-3 messages and all phase-3 messages arrive to members(M) but none arrive to members(M'). Now the system is stuck: M' - (M n M') cannot serve client requests whereas members(M) are no longer in the system. This is why I think before garbage-collecting M we must make sure that M' is operational by itself. {quote} Could this be fixed by sending the message to M' first and then sending to M after receiving ack from majority of M' {quote} 2. In my proposal members(M') can accept operations as soon as they get the phase-2 message, which is the purpose of phase 2. I don't see why phase-2 is needed in your proposal. I suggest to stick with 3 phases s.t. phase-2 message allows M' to be independent. 3. Because of Global Primary Order property of ZAB we can't allow a commit message to arrive from leader(M) to members(M') after they are allowed to process messages independently in M'. Your solution handles this because members(M') close connections to leader(M) as soon as they're allowed to process messages independently, but then its not clear how the outstanding ops get committed. In order to commit these operations nevertheless so I suggest that this would be the task of leader(M'). Here's what I suggest: * after leader(M) sends phase-1 message to members(M) it starts sending every received new op to both members(M) and members(M') (treating them as followers). Commit messages must not be sent to members(M') by leader(M) and sending such messages to members(M) does not hurt but is not necessary. In any case, clients are not acked regarding these ops. * As soon as enough acks are received for the phase-1 message from members(M), leader(M) sends a phase-2 message to members(M') and from this moment doesn't accept new ops (unless leader(M) is also in M' and therefore acts also as leader(M')). * When members(M') receive the phase-2 message from leader(M) they send an ack both to leader(M) and to leader(M') - in practice it can send an ack to leader(M) first, then disconnect from it and then connect to leader(M') and send an ack to it but it doesn't matter for correctness. When leader(M') receives this ack from a quorum of members(M') (as well as the phase-2 message from leader(M)) it can commit all preceding ops in M'. {quote} There are subtle differences between the protocol that I suggested and your proposal. I am not fully convinced if the changes are necessary. I might be missing something here. I will get in touch with you offline for further clarification. There are a few problems in the revised protocol posted on the wiki: 1. leader(M) does not wait for confirmation of ongoing transactions from majority of M' (during phase 1). How do you guarantee that once M' starts leader election all the transactions that are committed to M are also committed to M? A majority of M' might be lagging behind and one of them might end up becoming the leader(M'). 2. Why is Step 8. ("Stop processing new operations, return failure for any further ops received") necessary? The protocol that I suggested does not reject any transactions from the clients. In the most common case of reconfiguration, only a subset (very likely 1) of the peers would be added/removed. So ZK client library can reconnect to one of the other ZK servers using the same code that it does today. As a result, the application will not even notice that a reconfiguration has happened (other than potentially receiving notifications about the new configuration). {quote} H. I agree - it's an overkill We can make sure that a quorum of M' is connected when the reconfiguration starts. {quote} What should we tell the administrator to do if majority of M' fail during reconfiguration? During normal operations, if a majority of nodes fail, then the admin has a choice to copy the DB from one of the live nodes to rest of the nodes and get the cluster going immediately. There is a risk of loosing some transactions, but there is also a chance that the one of the node has reasonably up-to-date copy of the data tree. However, during reconfiguration if majority of M' fail the cluster is unrecoverable even if majority of M are online. Are going to assume that the admin needs to take a backup before doing the reconfig? Thanks. -Vishal > Allow dynamic changes to server cluster membership > -------------------------------------------------- > > Key: ZOOKEEPER-107 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-107 > Project: ZooKeeper > Issue Type: Improvement > Components: server > Reporter: Patrick Hunt > Assignee: Henry Robinson > Attachments: SimpleAddition.rtf > > > Currently cluster membership is statically defined, adding/removing hosts > to/from the server cluster dynamically needs to be supported. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira