[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-2836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16131777#comment-16131777
 ] 

ASF GitHub Bot commented on ZOOKEEPER-2836:
-------------------------------------------

Github user bitgaoshu commented on a diff in the pull request:

    https://github.com/apache/zookeeper/pull/336#discussion_r133885327
  
    --- Diff: 
src/java/main/org/apache/zookeeper/server/quorum/QuorumCnxManager.java ---
    @@ -647,11 +648,10 @@ public void run() {
                             numRetries = 0;
                         }
                     } catch (IOException e) {
    -                    if (shutdown) {
    -                        break;
    -                    }
                         LOG.error("Exception while listening", e);
    -                    numRetries++;
    +                    if (!(e instanceof SocketTimeoutException)) {
    --- End diff --
    
    - update
    
    - l checked the native method `java.net.PlainSocketImpl.socketAccept(Native 
Method)` in 
[openjdk](http://hg.openjdk.java.net/jdk8u/jdk8u/jdk/file/9d617cfd6717/src/solaris/native/java/net/PlainSocketImpl.c),
 **line709-721**, in which it changed from 0 to -1. and then timeout of -1 is 
interpreted as an infinite timeout.  In some cases, [-1 was interpreted as a 
larger positive integer](https://lwn.net/Articles/483078/). so this issue 
always happend after 49days. It's my wild conjecture.


> QuorumCnxManager.Listener Thread Better handling of SocketTimeoutException
> --------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-2836
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-2836
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: leaderElection, quorum
>    Affects Versions: 3.4.6
>         Environment: Machine: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.78-1 
> x86_64 GNU/Linux
> Java Version: jdk64/jdk1.8.0_40
> zookeeper version:  3.4.6.2.3.2.0-2950 
>            Reporter: Amarjeet Singh
>            Priority: Critical
>
> QuorumCnxManager Listener thread blocks SocketServer on accept but we are 
> getting SocketTimeoutException  on our boxes after 49days 17 hours . As per 
> current code there is a 3 times retry and after that it says "_As I'm leaving 
> the listener thread, I won't be able to participate in leader election any 
> longer: $<hostname>/$<ip>:3888__" , Once server nodes reache this state and 
> we restart or add a new node ,it fails to join cluster and logs 'WARN  
> QuorumPeer<myid=1>/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@383 - Cannot open 
> channel to 3 at election address $<hostname>/$<ip>:3888' .
>         As there is no timeout specified for ServerSocket it should never 
> timeout but there are some already discussed issues where people have seen 
> this issue and added checks for SocketTimeoutException explicitly like 
> https://issues.apache.org/jira/browse/KARAF-3325 . 
>         I think we need to handle SocketTimeoutException on similar lines for 
> zookeeper as well 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to