[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

Hadoop QA (JIRA) Thu, 03 Aug 2017 06:46:18 -0700

    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112715#comment-16112715
 ]


Hadoop QA commented on ZOOKEEPER-1669:
--------------------------------------

+1 overall.  GitHub Pull Request  Build
      

    +1 @author.  The patch does not contain any @author tags.

    +0 tests included.  The patch appears to be a documentation patch that 
doesn't require tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs (version 3.0.1) 
warnings.

    +1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

    +1 core tests.  The patch passed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//console

This message is automatically generated.

> Operations to server will be timed-out while thousands of sessions expired 
> same time
> ------------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1669
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 3.3.5
>            Reporter: tokoot
>            Assignee: Cheney Sun
>              Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server 
> same time(client restarted or servers partitioned with clients), the server 
> will busy to close those "connections" and become unavailable. The problem is 
> in following:
>   private void closeSessionWithoutWakeup(long sessionId) {
>       HashSet<NIOServerCnxn> cnxns;
>           synchronized (this.cnxns) {
>               cnxns = (HashSet<NIOServerCnxn>)this.cnxns.clone();  // other 
> thread will block because of here
>           }
>       ...
>   }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect 
> ZooKeeper service. 
> Actually, we came across the issue during maintaining our HBase cluster, 
> which used a 5-server ZooKeeper cluster. 
> The HBase cluster was composed of many many regionservers (in thousand order 
> of magnitude), 
> and connected by tens thousands of clients to do massive reads/writes. 
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as 
> well. 
> Basically, each two or three weeks, Zookeeper would make leader relection 
> triggered by the zxid roll over. 
> The leader relection will cause the clients(HBase regionservers and HBase 
> clients) disconnected 
> and reconnected with Zookeeper servers in the mean time, and try to renew the 
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone 
> all the connections at first 
> in order to avoid race condition in multi-threads and go iterate the cloned 
> connection set one by one to 
> find the related session to renew. It's very time consuming. In our case 
> (described above), 
> it caused many region servers can't successfully renew session before session 
> timeout, 
> and eventually the HBase cluster lose these region servers and affect the 
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a 
> ConcurrentHashMap 
> to store session id and connection map relation, which is a thread-safe data 
> structure 
> and eliminate the necessary to clone the connection set at first.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ZOOKEEPER-1669) Operations to server will be timed-out while thousands of sessions expired same time

Reply via email to