[ https://issues.apache.org/jira/browse/HBASE-28366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834487#comment-17834487 ]
Hudson commented on HBASE-28366: -------------------------------- Results for branch branch-2.4 [build #715 on builds.a.o|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/715/]: (/) *{color:green}+1 overall{color}* ---- details (if available): (/) {color:green}+1 general checks{color} -- For more information [see general report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/715/General_20Nightly_20Build_20Report/] (/) {color:green}+1 jdk8 hadoop2 checks{color} -- For more information [see jdk8 (hadoop2) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/715/JDK8_20Nightly_20Build_20Report_20_28Hadoop2_29/] (/) {color:green}+1 jdk8 hadoop3 checks{color} -- For more information [see jdk8 (hadoop3) report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/715/JDK8_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 jdk11 hadoop3 checks{color} -- For more information [see jdk11 report|https://ci-hbase.apache.org/job/HBase%20Nightly/job/branch-2.4/715/JDK11_20Nightly_20Build_20Report_20_28Hadoop3_29/] (/) {color:green}+1 source release artifact{color} -- See build output for details. (/) {color:green}+1 client integration test{color} > Mis-order of SCP and regionServerReport results into region inconsistencies > --------------------------------------------------------------------------- > > Key: HBASE-28366 > URL: https://issues.apache.org/jira/browse/HBASE-28366 > Project: HBase > Issue Type: Bug > Affects Versions: 2.4.17, 3.0.0-beta-1, 2.5.7 > Reporter: Viraj Jasani > Assignee: Viraj Jasani > Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-2, 2.5.9 > > > If the regionserver is online but due to network issue, if it's rs ephemeral > node gets deleted in zookeeper, active master schedules the SCP. However, if > the regionserver is alive, it can still send regionServerReport to active > master. In the case where SCP assigns regions to other regionserver that were > previously hosted on the old regionserver (which is still alive), the old rs > can continue to sent regionServerReport to active master. > Eventually this results into region inconsistencies because region is alive > on two regionservers at the same time (though it's temporary state because > the rs will be aborted soon). While old regionserver can have zookeeper > connectivity issues, it can still make rpc calls to active master. > Logs: > SCP: > {code:java} > 2024-01-29 16:50:33,956 INFO [RegionServerTracker-0] > assignment.AssignmentManager - Scheduled ServerCrashProcedure pid=9812440 for > server1-114.xyz,61020,1706541866103 (carryingMeta=false) > server1-114.xyz,61020,1706541866103/CRASHED/regionCount=364/lock=java.util.concurrent.locks.ReentrantReadWriteLock@5d5fc31[Write > locks = 1, Read locks = 0], oldState=ONLINE. > 2024-01-29 16:50:33,956 DEBUG [RegionServerTracker-0] > procedure2.ProcedureExecutor - Stored pid=9812440, > state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure > server1-114.xyz,61020,1706541866103, splitWal=true, meta=false > 2024-01-29 16:50:33,973 INFO [PEWorker-36] procedure.ServerCrashProcedure - > Splitting WALs pid=9812440, state=RUNNABLE:SERVER_CRASH_SPLIT_LOGS, > locked=true; ServerCrashProcedure server1-114.xyz,61020,1706541866103, > splitWal=true, meta=false, isMeta: false > {code} > As part of SCP, d743ace9f70d55f55ba1ecc6dc49a5cb was assigned to another > server: > > {code:java} > 2024-01-29 16:50:42,656 INFO [PEWorker-24] procedure.MasterProcedureScheduler > - Took xlock for pid=9818494, ppid=9812440, > state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; > TransitRegionStateProcedure > table=PLATFORM_ENTITY.PLATFORM_IMMUTABLE_ENTITY_DATA, > region=d743ace9f70d55f55ba1ecc6dc49a5cb, ASSIGN > 2024-01-29 16:50:43,106 INFO [PEWorker-23] assignment.RegionStateStore - > pid=9818494 updating hbase:meta row=d743ace9f70d55f55ba1ecc6dc49a5cb, > regionState=OPEN, repBarrier=12867482, openSeqNum=12867482, > regionLocation=server1-65.xyz,61020,1706165574050 > {code} > > rs abort, after ~5 min: > {code:java} > 2024-01-29 16:54:27,235 ERROR [regionserver/server1-114:61020] > regionserver.HRegionServer - ***** ABORTING region server > server1-114.xyz,61020,1706541866103: Unexpected exception handling getData > ***** > org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode > = ConnectionLoss for /hbase/master > at org.apache.zookeeper.KeeperException.create(KeeperException.java:102) > at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1229) > at > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:414) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:403) > at > org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:367) > at > org.apache.hadoop.hbase.zookeeper.ZKNodeTracker.getData(ZKNodeTracker.java:180) > at > org.apache.hadoop.hbase.zookeeper.MasterAddressTracker.getMasterAddress(MasterAddressTracker.java:152) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.createRegionServerStatusStub(HRegionServer.java:2892) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.tryRegionServerReport(HRegionServer.java:1352) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1142) > {code} > > Several region transition failure report logs: > {code:java} > 2024-01-29 16:55:13,029 INFO [_REGION-regionserver/server1-114:61020-0] > regionserver.HRegionServer - Failed report transition server { host_name: > "server1-114.xyz" port: 61020 start_code: 1706541866103 } transition { > transition_code: CLOSED region_info { region_id: 1671555604277 table_name { > namespace: "default" qualifier: "TABLE1" } start_key: "abc" end_key: "xyz" > offline: false split: false replica_id: 0 } proc_id: -1 }; retry (#0) > immediately. > java.net.UnknownHostException: Call to address=master-server1.xyz:61000 > failed on local exception: java.net.UnknownHostException: > master-server1.xyz:61000 could not be resolved > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:239) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420) > at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114) > at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129) > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:365) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) > at > org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) > at > org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.net.UnknownHostException: master-server1.xyz:61000 could not > be resolved > at > org.apache.hadoop.hbase.ipc.RpcConnection.getRemoteInetAddress(RpcConnection.java:192) > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.connect(NettyRpcConnection.java:280) > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.access$900(NettyRpcConnection.java:79) > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection$5.run(NettyRpcConnection.java:340) > at > org.apache.hadoop.hbase.ipc.HBaseRpcControllerImpl.notifyOnCancel(HBaseRpcControllerImpl.java:259) > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.sendRequest0(NettyRpcConnection.java:323) > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:363) > ... 8 more > {code} > > The server is later started after sometime: > {code:java} > Mon Jan 29 17:01:35 UTC 2024 Starting regionserver on server1-114.xyz > core file size (blocks, -c) unlimited > ... > ... {code} > > In order to avoid region inconsistencies, we should clear regionserver -> > region reported map from active master. We can do it in two ways: > # Active: master should throw error while serving regionServerReport call if > the server is already considered dead due to deletion of ephemeral znode. > # Passive: run a chore that can compare the regionserver -> region map > generated from the regionserver reports with the actual live servers based on > the zk tracking, and remove any rs report that is no longer valid. -- This message was sent by Atlassian Jira (v8.20.10#820010)