[
https://issues.apache.org/jira/browse/ZOOKEEPER-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16112715#comment-16112715
]
Hadoop QA commented on ZOOKEEPER-1669:
--------------------------------------
+1 overall. GitHub Pull Request Build
+1 @author. The patch does not contain any @author tags.
+0 tests included. The patch appears to be a documentation patch that
doesn't require tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac
compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs (version 3.0.1)
warnings.
+1 release audit. The applied patch does not increase the total number of
release audit warnings.
+1 core tests. The patch passed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results:
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//testReport/
Findbugs warnings:
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
https://builds.apache.org/job/PreCommit-ZOOKEEPER-github-pr-build/920//console
This message is automatically generated.
> Operations to server will be timed-out while thousands of sessions expired
> same time
> ------------------------------------------------------------------------------------
>
> Key: ZOOKEEPER-1669
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1669
> Project: ZooKeeper
> Issue Type: Improvement
> Components: server
> Affects Versions: 3.3.5
> Reporter: tokoot
> Assignee: Cheney Sun
> Labels: performance
>
> If there are thousands of clients, and most of them disconnect with server
> same time(client restarted or servers partitioned with clients), the server
> will busy to close those "connections" and become unavailable. The problem is
> in following:
> private void closeSessionWithoutWakeup(long sessionId) {
> HashSet<NIOServerCnxn> cnxns;
> synchronized (this.cnxns) {
> cnxns = (HashSet<NIOServerCnxn>)this.cnxns.clone(); // other
> thread will block because of here
> }
> ...
> }
> A real world example that demonstrated this problem (Kudos to [~sun.cheney]):
> {noformat}
> The issue is raised while tens thousands of clients try to reconnect
> ZooKeeper service.
> Actually, we came across the issue during maintaining our HBase cluster,
> which used a 5-server ZooKeeper cluster.
> The HBase cluster was composed of many many regionservers (in thousand order
> of magnitude),
> and connected by tens thousands of clients to do massive reads/writes.
> Because the r/w throughput is very high, ZooKeeper zxid increased quickly as
> well.
> Basically, each two or three weeks, Zookeeper would make leader relection
> triggered by the zxid roll over.
> The leader relection will cause the clients(HBase regionservers and HBase
> clients) disconnected
> and reconnected with Zookeeper servers in the mean time, and try to renew the
> sessions.
> In current implementation of session renew, NIOServerCnxnFactory will clone
> all the connections at first
> in order to avoid race condition in multi-threads and go iterate the cloned
> connection set one by one to
> find the related session to renew. It's very time consuming. In our case
> (described above),
> it caused many region servers can't successfully renew session before session
> timeout,
> and eventually the HBase cluster lose these region servers and affect the
> HBase stability.
> The change is to make refactoring to the close session logic and introduce a
> ConcurrentHashMap
> to store session id and connection map relation, which is a thread-safe data
> structure
> and eliminate the necessary to clone the connection set at first.
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)