[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438622#comment-13438622 ]
Hudson commented on HBASE-6364: ------------------------------- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #140 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/140/]) HBASE-6364 Powering down the server host holding the .META. table causes HBase Client to take excessively long to recover and connect to reassigned .META. table (Revision 1375473) Result = FAILURE nkeywal : Files : * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/HConstants.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/ClientCache.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/HBaseClient.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/EnvironmentEdgeManager.java * /hbase/trunk/hbase-server/src/main/java/org/apache/hadoop/hbase/util/ManualEnvironmentEdge.java * /hbase/trunk/hbase-server/src/test/java/org/apache/hadoop/hbase/ipc/TestHBaseClient.java > Powering down the server host holding the .META. table causes HBase Client to > take excessively long to recover and connect to reassigned .META. table > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-6364 > URL: https://issues.apache.org/jira/browse/HBASE-6364 > Project: HBase > Issue Type: Bug > Components: client > Affects Versions: 0.90.6, 0.92.1, 0.94.0 > Reporter: Suraj Varma > Assignee: nkeywal > Labels: client > Fix For: 0.96.0 > > Attachments: 6364-host-serving-META.v1.patch, > 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, > 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, > 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, > 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt > > > When a server host with a Region Server holding the .META. table is powered > down on a live cluster, while the HBase cluster itself detects and reassigns > the .META. table, connected HBase Client's take an excessively long time to > detect this and re-discover the reassigned .META. > Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low > value (default is 20s leading to 35 minute recovery time; we were able to get > acceptable results with 100ms getting a 3 minute recovery) > This was found during some hardware failure testing scenarios. > Test Case: > 1) Apply load via client app on HBase cluster for several minutes > 2) Power down the region server holding the .META. server (i.e. power off ... > and keep it off) > 3) Measure how long it takes for cluster to reassign META table and for > client threads to re-lookup and re-orient to the lesser cluster (minus the RS > and DN on that host). > Observation: > 1) Client threads spike up to maxThreads size ... and take over 35 mins to > recover (i.e. for the thread count to go back to normal) - no client calls > are serviced - they just back up on a synchronized method (see #2 below) > 2) All the client app threads queue up behind the > oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj > After taking several thread dumps we found that the thread within this > synchronized method was blocked on NetUtils.connect(this.socket, > remoteId.getAddress(), getSocketTimeout(conf)); > The client thread that gets the synchronized lock would try to connect to the > dead RS (till socket times out after 20s), retries, and then the next thread > gets in and so forth in a serial manner. > Workaround: > ------------------- > Default ipc.socket.timeout is set to 20s. We dropped this to a low number > (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, > the client threads recovered in a couple of minutes by failing fast and > re-discovering the .META. table on a reassigned RS. > Assumption: This ipc.socket.timeout is only ever used during the initial > "HConnection" setup via the NetUtils.connect and should only ever be used > when connectivity to a region server is lost and needs to be re-established. > i.e it does not affect the normal "RPC" actiivity as this is just the connect > timeout. > During RS GC periods, any _new_ clients trying to connect will fail and will > require .META. table re-lookups. > This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira