[ https://issues.apache.org/jira/browse/HBASE-14498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14936474#comment-14936474 ]
Bhupendra Kumar Jain commented on HBASE-14498: ---------------------------------------------- To fix this one of the approach can be as below: Start a timer thread on receiving the Disconnected event. The timer thread will check after every 100 ms if any SyncConnected event is received. if SyncConnected event is received, then it means, the ZK connection is established again. So this thread can stop and no need of any further action. if SyncConnected is not received in specified time interval (t time) , then this thread can initiate master abort. The time interval (t) should be less than the ZK Session time out. (May be 2/3rd of session time out value ) , This is to make sure that standby HM will not become active within this time period. > Master stuck in infinite loop when all Zookeeper servers are unreachable. > ------------------------------------------------------------------------- > > Key: HBASE-14498 > URL: https://issues.apache.org/jira/browse/HBASE-14498 > Project: HBase > Issue Type: Bug > Components: master > Affects Versions: 1.0.1 > Reporter: Y. SREENIVASULU REDDY > Assignee: Pankaj Kumar > Priority: Blocker > > We met a weird scenario in our production environment. > In a HA cluster, > > Active Master (HM1) is not able to connect to any Zookeeper server (due to > > N/w breakdown on master machine network with Zookeeper servers). > {code} > 2015-09-26 15:24:47,508 INFO > [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host:2181)] > zookeeper.ClientCnxn: Client session timed out, have not heard from server in > 33463ms for sessionid 0x104576b8dda0002, closing socket connection and > attempting reconnect > 2015-09-26 15:24:47,877 INFO > [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] > client.FourLetterWordMain: connecting to ZK-Host1 2181 > 2015-09-26 15:24:48,236 INFO [main-SendThread(ZK-Host1:2181)] > client.FourLetterWordMain: connecting to ZK-Host1 2181 > 2015-09-26 15:24:49,879 WARN > [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] > zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1 > 2015-09-26 15:24:49,879 INFO > [HM1-Host:16000.activeMasterManager-SendThread(ZK-Host1:2181)] > zookeeper.ClientCnxn: Opening socket connection to server > ZK-Host1/ZK-IP1:2181. Will not attempt to authenticate using SASL (unknown > error) > 2015-09-26 15:24:50,238 WARN [main-SendThread(ZK-Host1:2181)] > zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host1 > 2015-09-26 15:24:50,238 INFO [main-SendThread(ZK-Host1:2181)] > zookeeper.ClientCnxn: Opening socket connection to server > ZK-Host1/ZK-Host1:2181. Will not attempt to authenticate using SASL (unknown > error) > 2015-09-26 15:25:17,470 INFO [main-SendThread(ZK-Host1:2181)] > zookeeper.ClientCnxn: Client session timed out, have not heard from server in > 30023ms for sessionid 0x2045762cc710006, closing socket connection and > attempting reconnect > 2015-09-26 15:25:17,571 WARN [master/HM1-Host/HM1-IP:16000] > zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, > quorum=ZK-Host:2181,ZK-Host1:2181,ZK-Host2:2181, > exception=org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /hbase/master > 2015-09-26 15:25:17,872 INFO [main-SendThread(ZK-Host:2181)] > client.FourLetterWordMain: connecting to ZK-Host 2181 > 2015-09-26 15:25:19,874 WARN [main-SendThread(ZK-Host:2181)] > zookeeper.ClientCnxn: Can not get the principle name from server ZK-Host > 2015-09-26 15:25:19,874 INFO [main-SendThread(ZK-Host:2181)] > zookeeper.ClientCnxn: Opening socket connection to server ZK-Host/ZK-IP:2181. > Will not attempt to authenticate using SASL (unknown error) > {code} > > Since HM1 was not able to connect to any ZK, so session timeout didnt > > happen at Zookeeper server side and HM1 didnt abort. > > On Zookeeper session timeout standby master (HM2) registered himself as an > > active master. > > HM2 is keep on waiting for region server to report him as part of active > > master intialization. > {noformat} > 2015-09-26 15:24:44,928 | INFO | HM2-Host:21300.activeMasterManager | Waiting > for region servers count to settle; currently checked in 0, slept for 0 ms, > expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval > of 1500 ms. | > org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011) > --- > --- > 2015-09-26 15:32:50,841 | INFO | HM2-Host:21300.activeMasterManager | Waiting > for region servers count to settle; currently checked in 0, slept for 483913 > ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, > interval of 1500 ms. | > org.apache.hadoop.hbase.master.ServerManager.waitForRegionServers(ServerManager.java:1011) > {noformat} > > At other end, region servers are reporting to HM1 on 3 sec interval. Here > > region server retrieve master location from zookeeper only when they > > couldn't connect to Master (ServiceException). > Region Server will not report HM2 as per current design until unless HM1 > abort,so HM2 will exit(InitializationMonitor) and again wait for region > servers in loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)