[
https://issues.apache.org/jira/browse/ZOOKEEPER-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133178#comment-16133178
]
ASF GitHub Bot commented on ZOOKEEPER-2836:
-------------------------------------------
Github user skamille commented on a diff in the pull request:
https://github.com/apache/zookeeper/pull/336#discussion_r133994057
--- Diff:
src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java ---
@@ -638,13 +639,22 @@ public void run() {
LOG.info("My election bind port: " + addr.toString());
setName(addr.toString());
ss.bind(addr);
+ ss.setSoTimeout(10 * 1000); // Ten seconds
+ long acceptStartTime = System.currentTimeMillis();
while (!shutdown) {
- client = ss.accept();
- setSockOpts(client);
- LOG.info("Received connection request "
- + client.getRemoteSocketAddress());
- receiveConnection(client);
- numRetries = 0;
+ try {
+ client = ss.accept();
+ setSockOpts(client);
+ LOG.info("Received connection request "
+ + client.getRemoteSocketAddress());
+ receiveConnection(client);
+ numRetries = 0;
+ } catch (SocketTimeoutException e) {
+ LOG.warn("The socket is listening for the
election accepted "
+ + "an unexpected timeout ["
+ + (System.currentTimeMillis() -
acceptStartTime) + "]milliseconds"
+ + "after the call to accept(). is
this an instance of bug ZOOKEEPER-2836?");
--- End diff --
I don't love this error message and it doesn't make sense because we've set
the socket timeout above so it will never time out based on that weird possible
JVM error. So either we just continue after a timeout, or leave the timeout at
0 and leave a log statement that indicates it timed out unexpectedly but will
retry, but the double fix doesn't really make sense to me.
> QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException
> --------------------------------------------------------------------------
>
> Key: ZOOKEEPER-2836
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2836
> Project: ZooKeeper
> Issue Type: Bug
> Components: leaderElection, quorum
> Affects Versions: 3.4.6
> Environment: Machine: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1
> x86_64 GNU/Linux
> Java Version: jdk64/jdk1.8.0_40
> zookeeper version: 3.4.6.2.3.2.0-2950
> Reporter: Amarjeet Singh
> Priority: Critical
>
> QuorumCnxManager Listener thread blocks SocketServer on accept but we are
> getting SocketTimeoutException on our boxes after 49days 17 hours . As per
> current code there is a 3 times retry and after that it says "_As I'm leaving
> the listener thread, I won't be able to participate in leader election any
> longer: $<hostname>/$<ip>:3888__" , Once server nodes reache this state and
> we restart or add a new node ,it fails to join cluster and logs 'WARN
> QuorumPeer<myid=1>/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@383 - Cannot open
> channel to 3 at election address $<hostname>/$<ip>:3888' .
> As there is no timeout specified for ServerSocket it should never
> timeout but there are some already discussed issues where people have seen
> this issue and added checks for SocketTimeoutException explicitly like
> https://issues.apache.org/jira/browse/KARAF-3325 .
> I think we need to handle SocketTimeoutException on similar lines for
> zookeeper as well
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)