[ https://issues.apache.org/jira/browse/HBASE-6364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13438612#comment-13438612 ]
nkeywal commented on HBASE-6364: -------------------------------- Committed revision 1375473. After another fishy and not reproduced error on org.apache.hadoop.hbase.TestMultiVersions > Powering down the server host holding the .META. table causes HBase Client to > take excessively long to recover and connect to reassigned .META. table > ----------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-6364 > URL: https://issues.apache.org/jira/browse/HBASE-6364 > Project: HBase > Issue Type: Bug > Components: client > Affects Versions: 0.90.6, 0.92.1, 0.94.0 > Reporter: Suraj Varma > Assignee: nkeywal > Labels: client > Fix For: 0.96.0 > > Attachments: 6364-host-serving-META.v1.patch, > 6364.v11.nolargetest.patch, 6364.v1.patch, 6364.v1.patch, 6364.v2.patch, > 6364.v3.patch, 6364.v3.patch, 6364.v5.patch, 6364.v5.withtests.patch, > 6364.v6.patch, 6364.v6.withtests.patch, 6364.v7.withtests.patch, > 6364.v8.withtests.patch, 6364.v9.patch, stacktrace.txt > > > When a server host with a Region Server holding the .META. table is powered > down on a live cluster, while the HBase cluster itself detects and reassigns > the .META. table, connected HBase Client's take an excessively long time to > detect this and re-discover the reassigned .META. > Workaround: Decrease the ipc.socket.timeout on HBase Client side to a low > value (default is 20s leading to 35 minute recovery time; we were able to get > acceptable results with 100ms getting a 3 minute recovery) > This was found during some hardware failure testing scenarios. > Test Case: > 1) Apply load via client app on HBase cluster for several minutes > 2) Power down the region server holding the .META. server (i.e. power off ... > and keep it off) > 3) Measure how long it takes for cluster to reassign META table and for > client threads to re-lookup and re-orient to the lesser cluster (minus the RS > and DN on that host). > Observation: > 1) Client threads spike up to maxThreads size ... and take over 35 mins to > recover (i.e. for the thread count to go back to normal) - no client calls > are serviced - they just back up on a synchronized method (see #2 below) > 2) All the client app threads queue up behind the > oahh.ipc.HBaseClient#setupIOStreams method http://tinyurl.com/7js53dj > After taking several thread dumps we found that the thread within this > synchronized method was blocked on NetUtils.connect(this.socket, > remoteId.getAddress(), getSocketTimeout(conf)); > The client thread that gets the synchronized lock would try to connect to the > dead RS (till socket times out after 20s), retries, and then the next thread > gets in and so forth in a serial manner. > Workaround: > ------------------- > Default ipc.socket.timeout is set to 20s. We dropped this to a low number > (1000 ms, 100 ms, etc) on the client side hbase-site.xml. With this setting, > the client threads recovered in a couple of minutes by failing fast and > re-discovering the .META. table on a reassigned RS. > Assumption: This ipc.socket.timeout is only ever used during the initial > "HConnection" setup via the NetUtils.connect and should only ever be used > when connectivity to a region server is lost and needs to be re-established. > i.e it does not affect the normal "RPC" actiivity as this is just the connect > timeout. > During RS GC periods, any _new_ clients trying to connect will fail and will > require .META. table re-lookups. > This above timeout workaround is only for the HBase client side. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira