[
https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972376#comment-13972376
]
Alexander Shraer commented on ZOOKEEPER-1807:
---------------------------------------------
The failing test raises a couple of interesting issues...
Mainly I think there is a "race" between the completion of FLE where we only
require a quorum of old config and the establishment of new leader where we'd
need both old and new quorums if we're recovering from a failed reconfig. It
looks like we should ensure that we at least have a quorum of new config before
ending FLE and moving to the next stage where we actually need this quorum.
Here are two scenarios where this seems important.
Suppose we have A, B in old config and A, B, C, D, E in new one.
Suppose A, B rebooted during reconfig and will now have to recover (commit or
join the new config).
Case 1 (the failing test): C, D, E committed the reconfig. If A and B don't
establish connection to C, D, E before completing FLE they won't find out about
the new config being committed and will continuously try and fail to complete
the reconfig (they'll fail because they won't get a quorum of new config). Its
sort of ok since C, D, E are up and running, and possibly C D E will eventually
contact A and B, but perhaps we should avoid this scenario anyway. By ensuring
that A,B talk with a quorum of new config during FLE we guarantee that they
switch to new config and not try to establish a leader in old one.
Case 2: if C, D, E hasn't committed the new config and are actually trying to
connect to A and B, but A and B could complete FLE before hearing from C, D, E
they may again end up giving up and returning to FLE because they have no
quorum of new config.
So perhaps we should send the notifications to new config too and enforce
having a quorum of new config before FLE is complete...
> Observers spam each other creating connections to the election addr
> -------------------------------------------------------------------
>
> Key: ZOOKEEPER-1807
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807
> Project: ZooKeeper
> Issue Type: Bug
> Reporter: Raul Gutierrez Segales
> Assignee: Alexander Shraer
> Priority: Blocker
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1807-alex.patch, ZOOKEEPER-1807-ver2.patch,
> ZOOKEEPER-1807-ver3.patch, ZOOKEEPER-1807-ver4.patch,
> ZOOKEEPER-1807-ver5.patch, ZOOKEEPER-1807.patch, notifications-loop.png
>
>
> Hey [~shralex],
> I noticed today that my Observers are spamming each other trying to open
> connections to the election port. I've got tons of these:
> {noformat}
> 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a
> connection already for server 9
> 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a
> connection already for server 10
> 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a
> connection already for server 6
> 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a
> connection already for server 12
> 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a
> connection already for server 14
> {noformat}
> and so and so on ad nauseam.
> Now, looking around I found this inside FastLeaderElection.java from when you
> committed ZOOKEEPER-107:
> {noformat}
> private void sendNotifications() {
> - for (QuorumServer server : self.getVotingView().values()) {
> - long sid = server.id;
> -
> + for (long sid : self.getAllKnownServerIds()) {
> + QuorumVerifier qv = self.getQuorumVerifier();
> {noformat}
> Is that really desired? I suspect that is what's causing Observers to try to
> connect to each other (as opposed as just connecting to participants). I'll
> give it a try now and let you know. (Also, we use observer ids that are > 0,
> and I saw some parts of the code that might not deal with that assumption -
> so it could be that too..).
--
This message was sent by Atlassian JIRA
(v6.2#6252)