David Manning created HBASE-28422:
-------------------------------------
Summary: SplitWalProcedure will attempt SplitWalRemoteProcedure on
the same target RegionServer indefinitely
Key: HBASE-28422
URL: https://issues.apache.org/jira/browse/HBASE-28422
Project: HBase
Issue Type: Bug
Components: master, proc-v2, wal
Affects Versions: 2.5.5
Reporter: David Manning
Similar to HBASE-28050. If HMaster selects a RegionServer for
SplitWalRemoteProcedure, it will retry this server as long as the server is
alive. I believe this is because even though
{{RSProcedureDispatcher.ExecuteProceduresRemoteCall.run}} calls
{{{}remoteCallFailed{}}}, there is no logic after this to select a new target
server. For {{TransitRegionStateProcedure}} there is logic to select a new
server for opening a region, using {{{}forceNewPlan{}}}. But
SplitWalRemoteProcedure only has logic to try another server if we receive a
{{DoNotRetryIOException}} in SplitWALRemoteProcedure#complete:
[https://github.com/apache/hbase/blob/780ff56b3f23e7041ef1b705b7d3d0a53fdd05ae/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/SplitWALRemoteProcedure.java#L104-L110]
If we receive any other IOException, we will just retry the target server
forever. Just like in HBASE-28050, if there is a SaslException, this will never
lead to retrying a SplitWalRemoteProcedure on a new server, which can lead to
ServerCrashProcedure never finishing until the target server for
SplitWalRemoteProcedure is restarted. The following log is seen repeatedly,
always sending to the same host.
{code:java}
2024-01-31 15:59:43,616 WARN [RSProcedureDispatcher-pool-72846]
procedure.SplitWALRemoteProcedure - Failed split of
hdfs://<ns>/hbase/WALs/<host>,1704984571464-splitting/<host>1704984571464.1706710908543,
retry...
java.io.IOException: Call to address=<host> failed on local exception:
java.io.IOException: Can not send request because relogin is in progress.
at sun.reflect.GeneratedConstructorAccessor363.newInstance(Unknown
Source)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:239)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425)
at
org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420)
at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114)
at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129)
at
org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:365)
at
org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174)
at
org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167)
at
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
at
org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403)
at
org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
at
org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at
org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.IOException: Can not send request because relogin is in
progress.
at
org.apache.hadoop.hbase.ipc.NettyRpcConnection.sendRequest0(NettyRpcConnection.java:321)
at
org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:363)
... 8 more
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)