[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13011513#comment-13011513
 ] 

Alexander Shraer commented on ZOOKEEPER-107:
--------------------------------------------

Hi Vishal,

Thanks for the comments.

> What will M' do if the leader(M) fails? 

It depends at what stage. Before phase 2, M' does nothing - M will elect a new 
leader and that leader
will either complete the reconfiguration if it find traces of it stored by 
phase-1 in M, or otherwise will 
abandon it (in which case s or the client will perhaps re-submit it). As part 
of the recovery protocol (see twiki),
the new leader instructs members(M') to connect to it, similarly to the way s 
originally instructed them to connect to 
the old leader(M).

If leader(M) fails after a phase-2 message arrived to some process in M', that 
process will forward the message
to the rest of M'. If this process fails before/during forwarding, this means 
that M still has a quorum and we're in the same
case as described above. Otherwise, members(M') will get this message and run 
leader election. According to what
I suggested, leader election will not be run if the appointed leader(M') is up. 
Anyway, I think that after getting the
Phase-2 (activation) message, a process in M' will not agree to connect to M, 
instead it tries to convince the rest of M' to start running.

> appointing a leader

This seems like an important optimization that can be easily done since the 
leader has a quorum of responsive members from M', he can just pick one of them 
and save leader election. All this means is that before running leader election 
in M' a process will attempt to connect to the appointed leader(M'), and if 
this fails they'll still run the leader election. Admittedly, I don't yet fully 
understand the code,
so I might be wrong and this might in fact turn out to be too complicated.

> clients 

We were discussing this point with Flavio a bit, and there are some initial 
ideas. In any case we need some sort of DNS as a fallback, for clients that 
were disconnected during the reconfiguration - when they wake up there might no 
longer be anyone from M alive.

> Why should we not fix ZOOKEEPER-335? 

We should, but I'm currently focusing on the main reconfiguration stuff. 
If you can work on this bug of course that would be great.

> Could this be fixed by sending the message to M' first and then sending
> to M after receiving ack from majority of M'

Yes, and this is exactly how the algorithm in the twiki is structured - phase 2 
contacts M'
and only after a quorum acks, (the optional) phase 3 garbage-collects M

> I will get in touch with you offline for further clarification.

Sure.
 
> 1. leader(M) does not wait for confirmation of ongoing transactions
> from majority of M' (during phase 1). How do you guarantee that once M'
> starts leader election all the transactions that are committed to M are also
> committed to M? A majority of M' might be lagging behind and one of them might
> end up becoming the leader(M').

Because of FIFO and since M' are connected as followers from the beginning of 
the reconfiguration,
every process in M' that gets the activation message from leader(M) had 
previously received all the transactions from leader(M). 
The first thing that leader(M') does when a quorum of M' connects to it is 
commit all these transactions in M'.
So I think my proposal is correct. Having said that, we might want to commit 
the transactions in M' if we want to 
transfer clients to M' gradually, as suggested by Flavio.


> 2. Why is Step 8. ("Stop processing new operations, return failure for
> any further ops received") necessary? 

In the general case we cannot process operations in M once M' was activated as 
leader(M') may  
process operations in the new configuration M' (otherwise we may get 
split-brain). 
Of course, if leader(M) is in M' there is no need to stop processing operations.

> What should we tell the administrator to do if majority of M' fail
> during
> reconfiguration? During normal operations, if a majority of nodes fail,
> then
> the admin has a choice to copy the DB from one of the live nodes to
> rest of the
> nodes and get the cluster going immediately. There is a risk of loosing
> some
> transactions, but there is also a chance that the one of the node has
> reasonably up-to-date copy of the data tree.  However, during
> reconfiguration
> if majority of M' fail the cluster is unrecoverable even if majority of
> M are
> online. Are going to assume that the admin needs to take a backup
> before doing
> the reconfig?

I didn't really understand why the cluster is unrecoverable even if majority of 
M are online. 
I think it depends when the crash happens. If a quorum of M' cannot connect 
before reconfiguration begins,
we can abort the reconfiguration. If they fail during phase-1, we can continue 
in M and shouldn't start M' until quorum M' is up.
If they fail during phase-2, leader(M) will not be able to complete phase-2, so 
M will not be garbage-collected, although the system is stuck. If phase-2 
completes then M' is active and a quorum of M' has all the state, and M can be 
garbage-collected. Here, I don't see the difference from a normal execution 
without reconfigurations. 

Best Regards,
Alex


> Allow dynamic changes to server cluster membership
> --------------------------------------------------
>
>                 Key: ZOOKEEPER-107
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-107
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>            Reporter: Patrick Hunt
>            Assignee: Henry Robinson
>         Attachments: SimpleAddition.rtf
>
>
> Currently cluster membership is statically defined, adding/removing hosts 
> to/from the server cluster dynamically needs to be supported.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to