[
https://issues.apache.org/jira/browse/ZOOKEEPER-1807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14062797#comment-14062797
]
Alexander Shraer commented on ZOOKEEPER-1807:
---------------------------------------------
[~fpj], I'm trying to test your theory that new servers will continue to ping
old ones until they connect. This scenario (I described in my previous message)
comes up in the testNextConfigAlreadyActive in ReconfigRecoveryTest, which
fails with the latest patch.
it seems that servers 2 3 4 try to contact 0 and 1 but only once or twice and
then stop trying. Do you know why this could be happening ? or where the retry
logic implemented ? the log below is everything I get with respect to
connection attempts, even if I wait longer.
3 Opening channel to server 0
2 Opening channel to server 0
2 Cannot open channel to 0 at election address localhost/127.0.0.1:11223
3 Cannot open channel to 0 at election address localhost/127.0.0.1:11223
3 Opening channel to server 1
2 Opening channel to server 1
3 Cannot open channel to 1 at election address localhost/127.0.0.1:11226
3 Opening channel to server 2
2 Cannot open channel to 1 at election address localhost/127.0.0.1:11226
3 Connected to server 2
2 Opening channel to server 3
2 Connected to server 3
4 Opening channel to server 0
4 Cannot open channel to 0 at election address localhost/127.0.0.1:11223
4 Opening channel to server 1
4 Cannot open channel to 1 at election address localhost/127.0.0.1:11226
4 Opening channel to server 2
4 Connected to server 2
2 Opening channel to server 4
2 Connected to server 4
3 Opening channel to server 4
3 Connected to server 4
2 Opening channel to server 0
4 Opening channel to server 3
4 Connected to server 3
2 Cannot open channel to 0 at election address localhost/127.0.0.1:11223
2 Opening channel to server 1
2 Cannot open channel to 1 at election address localhost/127.0.0.1:11226
4 Opening channel to server 3
4 Connected to server 3
2 Opening channel to server 0
2 Cannot open channel to 0 at election address localhost/127.0.0.1:11223
2 Opening channel to server 1
2 Cannot open channel to 1 at election address localhost/127.0.0.1:11226
3 Opening channel to server 0
3 Cannot open channel to 0 at election address localhost/127.0.0.1:11223
3 Opening channel to server 1
3 Cannot open channel to 1 at election address localhost/127.0.0.1:11226
0 Opening channel to server 1
0 Cannot open channel to 1 at election address localhost/127.0.0.1:11226
0 Opening channel to server 1
0 Cannot open channel to 1 at election address localhost/127.0.0.1:11226
1 Opening channel to server 0
1 Connected to server 0
> Observers spam each other creating connections to the election addr
> -------------------------------------------------------------------
>
> Key: ZOOKEEPER-1807
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1807
> Project: ZooKeeper
> Issue Type: Bug
> Reporter: Raul Gutierrez Segales
> Assignee: Alexander Shraer
> Priority: Blocker
> Fix For: 3.5.0
>
> Attachments: ZOOKEEPER-1807-alex.patch, ZOOKEEPER-1807-ver2.patch,
> ZOOKEEPER-1807-ver3.patch, ZOOKEEPER-1807-ver4.patch,
> ZOOKEEPER-1807-ver5.patch, ZOOKEEPER-1807.patch, notifications-loop.png
>
>
> Hey [~shralex],
> I noticed today that my Observers are spamming each other trying to open
> connections to the election port. I've got tons of these:
> {noformat}
> 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a
> connection already for server 9
> 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a
> connection already for server 10
> 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a
> connection already for server 6
> 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a
> connection already for server 12
> 2013-11-01 22:19:45,819 - DEBUG [WorkerSender[myid=13]] - There is a
> connection already for server 14
> {noformat}
> and so and so on ad nauseam.
> Now, looking around I found this inside FastLeaderElection.java from when you
> committed ZOOKEEPER-107:
> {noformat}
> private void sendNotifications() {
> - for (QuorumServer server : self.getVotingView().values()) {
> - long sid = server.id;
> -
> + for (long sid : self.getAllKnownServerIds()) {
> + QuorumVerifier qv = self.getQuorumVerifier();
> {noformat}
> Is that really desired? I suspect that is what's causing Observers to try to
> connect to each other (as opposed as just connecting to participants). I'll
> give it a try now and let you know. (Also, we use observer ids that are > 0,
> and I saw some parts of the code that might not deal with that assumption -
> so it could be that too..).
--
This message was sent by Atlassian JIRA
(v6.2#6252)