[
https://issues.apache.org/jira/browse/HBASE-28951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani reopened HBASE-28951:
----------------------------------
Re-opening for backports
> Handle simultaneous WAL splitting to recovered edits by multiple worker
> -----------------------------------------------------------------------
>
> Key: HBASE-28951
> URL: https://issues.apache.org/jira/browse/HBASE-28951
> Project: HBase
> Issue Type: Bug
> Affects Versions: 2.5.8
> Reporter: Umesh Kumar Kumawat
> Assignee: Umesh Kumar Kumawat
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.0.0-beta-2
>
>
> When a worker RS gets aborted after the SplitWALRemoteProcedure got
> dispatched, RegionServerTracker takes care of it and [aborts the pending
> Operation|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/RSProcedureDispatcher.java#L160]
> on the aborting region as part of
> [expireServer|https://github.com/apache/hbase/blob/rel/2.5.8/hbase-server/src/main/java/org/apache/hadoop/hbase/master/RegionServerTracker.java#L172].
>
> It did help the parent procedure, SplitWalProcedure, to choose another worker
> RS but the aborting RS is also splitting the WAL. Now while creating the
> recovered edits both will try to write the same file. One RS that starts late
> for the file deletes the previous file that cause failures.
> h4. Logs -
> region server tracker marking the remove procedure failed
> {code:java}
> 2024-10-01 23:02:32,274 WARN [RegionServerTracker-0]
> procedure.SplitWALRemoteProcedure - Sent
> hdfs://hbase1a/hbase/WALs/regionserver-33X<cluster>,XXXXX,1727362162836-splitting/regionserver-33X<cluster>%2CXXXXX%2C1727362162836.1727822221172
> to wrong server regionserver-283X<cluster>,XXXXX,1727420096936, try another
> org.apache.hadoop.hbase.DoNotRetryIOException: server not online
> regionserver-283X<cluster>,XXXXX,1727420096936
> at
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:163)
> at
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.abortPendingOperations(RSProcedureDispatcher.java:61)
> at
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher$BufferNode.abortOperationsInQueue(RemoteProcedureDispatcher.java:417)
> at
> org.apache.hadoop.hbase.procedure2.RemoteProcedureDispatcher.removeNode(RemoteProcedureDispatcher.java:201)
> at
> org.apache.hadoop.hbase.master.procedure.RSProcedureDispatcher.serverRemoved(RSProcedureDispatcher.java:176)
> at
> org.apache.hadoop.hbase.master.ServerManager.lambda$expireServer$2(ServerManager.java:576)
> at
> java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
> at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:647)
> at
> org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:576)
> at
> org.apache.hadoop.hbase.master.ServerManager.expireServer(ServerManager.java:530)
> at
> org.apache.hadoop.hbase.master.RegionServerTracker.processAsActiveMaster(RegionServerTracker.java:172)
> at
> org.apache.hadoop.hbase.master.RegionServerTracker.refresh(RegionServerTracker.java:206)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750){code}
> {code:java}
> 2024-10-01 23:02:32,340 INFO [PEWorker-21] procedure2.ProcedureExecutor -
> Finished pid=122448609, ppid=122448595, state=SUCCESS;
> SplitWALRemoteProcedure
> regionserver-33X<cluster>,XXXXX%2C1727362162836.1727822221172,
> worker=regionserver-283X<cluster>,XXXXX,1727420096936 in 54.0500 sec{code}
> Parent SplitWalProcedure will create another RemoteProcedure for this
> {code:java}
> 2024-10-01 23:02:32,726 WARN [PEWorker-17] procedure.SplitWALProcedure -
> Failed to split wal
> hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33X<cluster>,XXXXX%2C1727362162836.1727822221172
> by server regionserver-283.regionserver.hbase.<cluster>,XXXXX,1727420096936,
> retry...{code}
> {code:java}
> 2024-10-01 23:02:39,414 INFO [PEWorker-28] procedure2.ProcedureExecutor -
> Initialized subprocedures=[{pid=122452821, ppid=122448595, state=RUNNABLE;
> SplitWALRemoteProcedure
> regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172,
>
> worker=regionserver-323.regionserver.hbase.<cluster>,XXXXX,1727308912906}]{code}
> Splitting still in progress on dying rs
> {code:java}
> 2024-10-01 23:02:45,652 INFO
> [G_REPLAY_OPS-regionserver/regionserver-283:XXXXX-0] wal.WALSplitter -
> Splitting
> hdfs://hbase1a/hbase/WALs/regionserver-33.regionserver.hbase.<cluster>,XXXXX,1727362162836-splitting/regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172,
> size=128.1 M (134313407bytes){code}
> rs-323 creating recovered edits
> {code:java}
> 2024-10-01 23:02:42,876 INFO
> [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2]
> monitor.StreamSlowMonitor - New stream slow monitor
> 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code}
> {code:java}
> 2024-10-01 23:02:43,171 INFO
> [OPS-regionserver/regionserver-323:XXXXX-5-Writer-2]
> wal.RecoveredEditsOutputSink - Creating recovered edits writer
> path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code}
> rs-283 deletes the above files and again creates the file
> {code:java}
> 2024-10-01 23:02:50,520 WARN
> [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2]
> wal.RecoveredEditsOutputSink - Found old edits file. It could be the result
> of a previous failed split attempt. Deleting
> hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp,
> length=0{code}
> {code:java}
> 2024-10-01 23:02:50,794 INFO
> [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2]
> monitor.StreamSlowMonitor - New stream slow monitor
> 0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code}
> {code:java}
> 2024-10-01 23:02:51,135 INFO
> [OPS-regionserver/regionserver-283:XXXXX-0-Writer-2]
> wal.RecoveredEditsOutputSink - Creating recovered edits writer
> path=hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp{code}
> Now rs 323 will start failing
> {code:java}
> 2024-10-01 23:03:02,137 WARN [Thread-1081409] hdfs.DataStreamer -
> DataStreamer Exception
> java.io.FileNotFoundException: File does not exist:
> /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33<cluster>%2C60020%2C1727362162836.1727822221172.temp
> (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1,
> pending creates: 21]
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091)
> at
> org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
> does not exist:
> /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33<cluster>%2C60020%2C1727362162836.1727822221172.temp
> (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1,
> pending creates: 21]
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060)
> at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567)
> at org.apache.hadoop.ipc.Client.call(Client.java:1513)
> at org.apache.hadoop.ipc.Client.call(Client.java:1410)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
> at com.sun.proxy.$Proxy18.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495)
> at
> org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495)
> at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
> at com.sun.proxy.$Proxy19.addBlock(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361)
> at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361)
> at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088)
> ... 3 more
> {code}
> {code:java}
> 2024-10-01 23:03:02,143 ERROR [split-log-closeStream-pool-1]
> wal.RecoveredEditsOutputSink - Could not close recovered edits at
> hdfs://hbase1a/hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp
> java.io.FileNotFoundException: File does not exist:
> /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp
> (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1,
> pending creates: 21]
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121)
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1091)
> at
> org.apache.hadoop.hdfs.DataStreamer.locateFollowingBlock(DataStreamer.java:1939)
> at
> org.apache.hadoop.hdfs.DataStreamer.setupPipelineForCreate(DataStreamer.java:1734)
> at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:717)
> Caused by:
> org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File
> does not exist:
> /hbase/data/default/SEARCH.REPLAY_ID_BATCH_INDEX_START_INDEX/d3be13a8187ff35746fff1def4f4dba4/recovered.edits/0000000000007468971-regionserver-33.regionserver.hbase.<cluster>%2CXXXXX%2C1727362162836.1727822221172.temp
> (inode 1440741238) [Lease. Holder: DFSClient_NONMAPREDUCE_-2039838105_1,
> pending creates: 21]
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3103)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.analyzeFileState(FSDirWriteFileOp.java:610)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.validateAddBlock(FSDirWriteFileOp.java:171)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2977)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:912)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:595)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:618)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1105)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1028)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3060) at
> org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1567)
> at org.apache.hadoop.ipc.Client.call(Client.java:1513)
> at org.apache.hadoop.ipc.Client.call(Client.java:1410)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
> at com.sun.proxy.$Proxy18.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.lambda$addBlock$11(ClientNamenodeProtocolTranslatorPB.java:495)
> at
> org.apache.hadoop.ipc.internal.ShadedProtobufHelper.ipc(ShadedProtobufHelper.java:160)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:495)
> at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
> at com.sun.proxy.$Proxy19.addBlock(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361)
> at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
> at sun.reflect.GeneratedMethodAccessor247.invoke(Unknown Source)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:361)
> at com.sun.proxy.$Proxy20.addBlock(Unknown Source)
> at
> org.apache.hadoop.hdfs.DFSOutputStream.addBlock(DFSOutputStream.java:1088)
> {code}
>
> Some more info that needs to be noted is - the aborting RS starts splitting a
> little late. I am adding one such case below. rs-219 is aborting one and
> rs-216 is the second worker. The aborting rs started 50 second late from the
> second worker although aborting rs was the one who got the RPC request first.
> {code:java}
> 2024-10-01 23:02:15,499 INFO
> [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter -
> Splitting
> hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024,
> size=92 (92bytes){code}
> {code:java}
> 2024-10-01 23:03:05,793 INFO
> [G_REPLAY_OPS-regionserver/regionserver-219:XXXXX-1] wal.WALSplitter -
> Splitting
> hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024,
> size=93.4 M (97950842bytes){code}
> {code:java}
> 2024-10-01 23:03:15,405 INFO
> [G_REPLAY_OPS-regionserver/regionserver-216:XXXXX-3] wal.WALSplitter -
> Splitting
> hdfs://hbase1a/hbase/WALs/regionserver-150.regionserver.hbase.<<cluster>>,XXXXX,1727347097348-splitting/regionserver-150.regionserver.hbase.<<cluster>>%2CXXXXX%2C1727347097348.1727823118024,
> size=93.4 M (97950842bytes){code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)