[jira] [Updated] (HBASE-24564) Make RS abort call idempotent

Bharath Vissapragada (Jira) Mon, 15 Jun 2020 14:59:15 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-24564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bharath Vissapragada updated HBASE-24564:
-----------------------------------------
    Description: 
We noticed this in our deployment based on branch-1, but it affects other 
branches too.

1. abort() is not idempotent. There can be multiple aborts that can 
un-necessarily complicate the state machine. Following is the timeline of 
actions.

- HMaster detected that the RS lost its ZK session and started the SCP. This 
was caused by ZK flakiness

{noformat}
2020-06-11 01:08:39,110 DEBUG [ProcedureExecutor-34] master.DeadServer - 
Started processing foo,60020,1591683150711; numProcessing=2
2020-06-11 01:08:39,110 INFO [ProcedureExecutor-34] 
procedure.ServerCrashProcedure - Start processing crashed 
foo,60020,1591683150711
{noformat}

- RS wakes up and attempts to report to master and receives a YouAreDead... 
This triggers an abort

{noformat}
Caused by: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
 org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing foo,60020,1591683150711 as dead server
        at 
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:438)
        at 
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:343)
        at 
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:359)
        at 
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2421)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291)

        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:413)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409)
        at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
        at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
        at 
org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600)
        at 
org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334)
        ... 1 more
{noformat}

- After a few seconds, RS also realizes that it lost the ZK session and 
initiates a second abort

{noformat}
2020-06-11 01:08:50,321 FATAL [main-EventThread] regionserver.HRegionServer - 
ABORTING region server foo,60020,1591683150711: 
regionserver:60020-0x1725cd18ff3c55f, quorum=foo:2181,bar:2181,baz:2181 
baseZNode=/hbase regionserver:60020-0x1725cd18ff3c55f received expired from 
ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:697)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:629)
        at 
org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:544)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:519)
{noformat}

Overall, there were two sequences of aborts running at the same time. This can 
be avoided by making abort idempotent.

-2.  Abort timeout task doesn't init as expected.- (edited, see comments)

  was:
We noticed this in our deployment based on branch-1, but it affects other 
branches too.

1. abort() is not idempotent. There can be multiple aborts that can 
un-necessarily complicate the state machine. Following is the timeline of 
actions.

- HMaster detected that the RS lost its ZK session and started the SCP. This 
was caused by ZK flakiness

{noformat}
2020-06-11 01:08:39,110 DEBUG [ProcedureExecutor-34] master.DeadServer - 
Started processing foo,60020,1591683150711; numProcessing=2
2020-06-11 01:08:39,110 INFO [ProcedureExecutor-34] 
procedure.ServerCrashProcedure - Start processing crashed 
foo,60020,1591683150711
{noformat}

- RS wakes up and attempts to report to master and receives a YouAreDead... 
This triggers an abort

{noformat}
Caused by: 
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
 org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing foo,60020,1591683150711 as dead server
        at 
org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:438)
        at 
org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:343)
        at 
org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:359)
        at 
org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2421)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311)
        at 
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291)

        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:413)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409)
        at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
        at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
        at 
org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600)
        at 
org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334)
        ... 1 more
{noformat}

- After a few seconds, RS also realizes that it lost the ZK session and 
initiates a second abort

{noformat}
2020-06-11 01:08:50,321 FATAL [main-EventThread] regionserver.HRegionServer - 
ABORTING region server foo,60020,1591683150711: 
regionserver:60020-0x1725cd18ff3c55f, quorum=foo:2181,bar:2181,baz:2181 
baseZNode=/hbase regionserver:60020-0x1725cd18ff3c55f received expired from 
ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:697)
        at 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:629)
        at 
org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
        at 
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:544)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:519)
{noformat}

Overall, there were two sequences of aborts running at the same time. This can 
be avoided by making abort idempotent.

2. Abort timeout task doesn't init as expected.

{noformat}
2020-06-11 01:08:49,960 WARN  [/10.231.91.171:60020] regionserver.HRegionServer 
- Initialize abort timeout task failed
java.lang.IllegalAccessException: Class 
org.apache.hadoop.hbase.regionserver.HRegionServer can not access a member of 
class 
org.apache.hadoop.hbase.regionserver.HRegionServer$SystemExitWhenAbortTimeout 
with modifiers "private"
        at sun.reflect.Reflection.ensureMemberAccess(Reflection.java:102)
        at 
java.lang.reflect.AccessibleObject.slowCheckMemberAccess(AccessibleObject.java:296)
        at 
java.lang.reflect.AccessibleObject.checkAccess(AccessibleObject.java:288)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:413)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1078)
{noformat}
Fix the visibility?


> Make RS abort call idempotent
> -----------------------------
>
>                 Key: HBASE-24564
>                 URL: https://issues.apache.org/jira/browse/HBASE-24564
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver
>    Affects Versions: 3.0.0-alpha-1, 2.3.0, 1.7.0
>            Reporter: Bharath Vissapragada
>            Assignee: Bharath Vissapragada
>            Priority: Major
>
> We noticed this in our deployment based on branch-1, but it affects other 
> branches too.
> 1. abort() is not idempotent. There can be multiple aborts that can 
> un-necessarily complicate the state machine. Following is the timeline of 
> actions.
> - HMaster detected that the RS lost its ZK session and started the SCP. This 
> was caused by ZK flakiness
> {noformat}
> 2020-06-11 01:08:39,110 DEBUG [ProcedureExecutor-34] master.DeadServer - 
> Started processing foo,60020,1591683150711; numProcessing=2
> 2020-06-11 01:08:39,110 INFO [ProcedureExecutor-34] 
> procedure.ServerCrashProcedure - Start processing crashed 
> foo,60020,1591683150711
> {noformat}
> - RS wakes up and attempts to report to master and receives a YouAreDead... 
> This triggers an abort
> {noformat}
> Caused by: 
> org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.YouAreDeadException):
>  org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; 
> currently processing foo,60020,1591683150711 as dead server
>         at 
> org.apache.hadoop.hbase.master.ServerManager.checkIsDead(ServerManager.java:438)
>         at 
> org.apache.hadoop.hbase.master.ServerManager.regionServerReport(ServerManager.java:343)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:359)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:8617)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2421)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:311)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:291)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:390)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:94)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:413)
>         at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:409)
>         at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:103)
>         at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:118)
>         at 
> org.apache.hadoop.hbase.ipc.BlockingRpcConnection.readResponse(BlockingRpcConnection.java:600)
>         at 
> org.apache.hadoop.hbase.ipc.BlockingRpcConnection.run(BlockingRpcConnection.java:334)
>         ... 1 more
> {noformat}
> - After a few seconds, RS also realizes that it lost the ZK session and 
> initiates a second abort
> {noformat}
> 2020-06-11 01:08:50,321 FATAL [main-EventThread] regionserver.HRegionServer - 
> ABORTING region server foo,60020,1591683150711: 
> regionserver:60020-0x1725cd18ff3c55f, quorum=foo:2181,bar:2181,baz:2181 
> baseZNode=/hbase regionserver:60020-0x1725cd18ff3c55f received expired from 
> ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode 
> = Session expired
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:697)
>         at 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:629)
>         at 
> org.apache.hadoop.hbase.zookeeper.PendingWatcher.process(PendingWatcher.java:40)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:544)
>         at 
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:519)
> {noformat}
> Overall, there were two sequences of aborts running at the same time. This 
> can be avoided by making abort idempotent.
> -2.  Abort timeout task doesn't init as expected.- (edited, see comments)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HBASE-24564) Make RS abort call idempotent

Reply via email to