[ https://issues.apache.org/jira/browse/HBASE-24564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bharath Vissapragada updated HBASE-24564: ----------------------------------------- Description: We noticed this in our deployment based on branch-1, but it affects other branches too. 1. abort() is not idempotent. There can be multiple aborts that can un-necessarily complicate the state machine. Following is the timeline of actions. - HMaster detected that the RS lost its ZK session and started the SCP. This was caused by ZK flakiness {noformat} 2020-06-11 01:08:39,110 DEBUG [ProcedureExecutor-34] master.DeadServer - Started processing foo,60020,1591683150711; numProcessing=2 2020-06-11 01:08:39,110 INFO [ProcedureExecutor-34] procedure.ServerCrashProcedure - Start processing crashed foo,60020,1591683150711 {noformat} - RS wakes up and attempts to report to master and receives a YouAreDead... This triggers an abort {noformat} Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing foo,60020,1591683150711 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:438) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:343) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:359) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2421) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:413) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600) at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334) ... 1 more {noformat} - After a few seconds, RS also realizes that it lost the ZK session and initiates a second abort {noformat} 2020-06-11 01:08:50,321 FATAL [main-EventThread] regionserver.HRegionServer - ABORTING region server foo,60020,1591683150711: regionserver:60020-0x1725cd18ff3c55f, quorum=foo:2181,bar:2181,baz:2181 baseZNode=/hbase regionserver:60020-0x1725cd18ff3c55f received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:697) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:629) at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:544) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:519) {noformat} Overall, there were two sequences of aborts running at the same time. This can be avoided by making abort idempotent. -2. Abort timeout task doesn't init as expected.- (edited, see comments) was: We noticed this in our deployment based on branch-1, but it affects other branches too. 1. abort() is not idempotent. There can be multiple aborts that can un-necessarily complicate the state machine. Following is the timeline of actions. - HMaster detected that the RS lost its ZK session and started the SCP. This was caused by ZK flakiness {noformat} 2020-06-11 01:08:39,110 DEBUG [ProcedureExecutor-34] master.DeadServer - Started processing foo,60020,1591683150711; numProcessing=2 2020-06-11 01:08:39,110 INFO [ProcedureExecutor-34] procedure.ServerCrashProcedure - Start processing crashed foo,60020,1591683150711 {noformat} - RS wakes up and attempts to report to master and receives a YouAreDead... This triggers an abort {noformat} Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing foo,60020,1591683150711 as dead server at org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:438) at org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:343) at org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:359) at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2421) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390) at org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:413) at org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409) at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600) at org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334) ... 1 more {noformat} - After a few seconds, RS also realizes that it lost the ZK session and initiates a second abort {noformat} 2020-06-11 01:08:50,321 FATAL [main-EventThread] regionserver.HRegionServer - ABORTING region server foo,60020,1591683150711: regionserver:60020-0x1725cd18ff3c55f, quorum=foo:2181,bar:2181,baz:2181 baseZNode=/hbase regionserver:60020-0x1725cd18ff3c55f received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:697) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:629) at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:544) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:519) {noformat} Overall, there were two sequences of aborts running at the same time. This can be avoided by making abort idempotent. 2. Abort timeout task doesn't init as expected. {noformat} 2020-06-11 01:08:49,960 WARN [/10.231.91.171:60020] regionserver.HRegionServer - Initialize abort timeout task failed java.lang.IllegalAccessException: Class org.apache.hadoop.hbase.regionserver.HRegionServer can not access a member of class org.apache.hadoop.hbase.regionserver.HRegionServer$SystemExitWhenAbortTimeout with modifiers "private" at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:102) at java.lang.reflect.AccessibleObject.slowCheckMemberAccess(AccessibleObject.java:296) at java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:288) at java.lang.reflect.Constructor.newInstance(Constructor.java:413) at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1078) {noformat} Fix the visibility? > Make RS abort call idempotent > ----------------------------- > > Key: HBASE-24564 > URL: https://issues.apache.org/jira/browse/HBASE-24564 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 3.0.0-alpha-1, 2.3.0, 1.7.0 > Reporter: Bharath Vissapragada > Assignee: Bharath Vissapragada > Priority: Major > > We noticed this in our deployment based on branch-1, but it affects other > branches too. > 1. abort() is not idempotent. There can be multiple aborts that can > un-necessarily complicate the state machine. Following is the timeline of > actions. > - HMaster detected that the RS lost its ZK session and started the SCP. This > was caused by ZK flakiness > {noformat} > 2020-06-11 01:08:39,110 DEBUG [ProcedureExecutor-34] master.DeadServer - > Started processing foo,60020,1591683150711; numProcessing=2 > 2020-06-11 01:08:39,110 INFO [ProcedureExecutor-34] > procedure.ServerCrashProcedure - Start processing crashed > foo,60020,1591683150711 > {noformat} > - RS wakes up and attempts to report to master and receives a YouAreDead... > This triggers an abort > {noformat} > Caused by: > org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException): > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; > currently processing foo,60020,1591683150711 as dead server > at > org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:438) > at > org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:343) > at > org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:359) > at > org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2421) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311) > at > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:413) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409) > at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103) > at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118) > at > org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600) > at > org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334) > ... 1 more > {noformat} > - After a few seconds, RS also realizes that it lost the ZK session and > initiates a second abort > {noformat} > 2020-06-11 01:08:50,321 FATAL [main-EventThread] regionserver.HRegionServer - > ABORTING region server foo,60020,1591683150711: > regionserver:60020-0x1725cd18ff3c55f, quorum=foo:2181,bar:2181,baz:2181 > baseZNode=/hbase regionserver:60020-0x1725cd18ff3c55f received expired from > ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode > = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:697) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:629) > at > org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:544) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:519) > {noformat} > Overall, there were two sequences of aborts running at the same time. This > can be avoided by making abort idempotent. > -2. Abort timeout task doesn't init as expected.- (edited, see comments) -- This message was sent by Atlassian Jira (v8.3.4#803005)