[jira] [Commented] (HBASE-28422) SplitWalProcedure will attempt SplitWalRemoteProcedure on the same target RegionServer indefinitely
[ https://issues.apache.org/jira/browse/HBASE-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824014#comment-17824014 ] Duo Zhang commented on HBASE-28422: --- {quote} Might as well be a good opportunity to refactor isSaslError() as a global static utility, available for use to anyone. {quote} +1. > SplitWalProcedure will attempt SplitWalRemoteProcedure on the same target > RegionServer indefinitely > --- > > Key: HBASE-28422 > URL: https://issues.apache.org/jira/browse/HBASE-28422 > Project: HBase > Issue Type: Bug > Components: master, proc-v2, wal >Affects Versions: 2.5.5 >Reporter: David Manning >Priority: Minor > > Similar to HBASE-28050. If HMaster selects a RegionServer for > SplitWalRemoteProcedure, it will retry this server as long as the server is > alive. I believe this is because even though > {{RSProcedureDispatcher.ExecuteProceduresRemoteCall.run}} calls > {{{}remoteCallFailed{}}}, there is no logic after this to select a new target > server. For {{TransitRegionStateProcedure}} there is logic to select a new > server for opening a region, using {{{}forceNewPlan{}}}. But > SplitWalRemoteProcedure only has logic to try another server if we receive a > {{DoNotRetryIOException}} in SplitWALRemoteProcedure#complete: > [https://github.com/apache/hbase/blob/780ff56b3f23e7041ef1b705b7d3d0a53fdd05ae/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/SplitWALRemoteProcedure.java#L104-L110] > If we receive any other IOException, we will just retry the target server > forever. Just like in HBASE-28050, if there is a SaslException, this will > never lead to retrying a SplitWalRemoteProcedure on a new server, which can > lead to ServerCrashProcedure never finishing until the target server for > SplitWalRemoteProcedure is restarted. The following log is seen repeatedly, > always sending to the same host. > {code:java} > 2024-01-31 15:59:43,616 WARN [RSProcedureDispatcher-pool-72846] > procedure.SplitWALRemoteProcedure - Failed split of > hdfs:///hbase/WALs/,1704984571464-splitting/1704984571464.1706710908543, > retry... > java.io.IOException: Call to address= failed on local exception: > java.io.IOException: Can not send request because relogin is in progress. > at sun.reflect.GeneratedConstructorAccessor363.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:239) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420) > at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114) > at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129) > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:365) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) > at > org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) > at > org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.io.IOException: Can not send request because relogin is in > progress. > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.sendRequest0(NettyRpcConnection.java:321) > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:363) > ... 8 more > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HBASE-28422) SplitWalProcedure will attempt SplitWalRemoteProcedure on the same target RegionServer indefinitely
[ https://issues.apache.org/jira/browse/HBASE-28422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17823811#comment-17823811 ] Viraj Jasani commented on HBASE-28422: -- Might as well be a good opportunity to refactor _isSaslError()_ as a global static utility, available for use to anyone. > SplitWalProcedure will attempt SplitWalRemoteProcedure on the same target > RegionServer indefinitely > --- > > Key: HBASE-28422 > URL: https://issues.apache.org/jira/browse/HBASE-28422 > Project: HBase > Issue Type: Bug > Components: master, proc-v2, wal >Affects Versions: 2.5.5 >Reporter: David Manning >Priority: Minor > > Similar to HBASE-28050. If HMaster selects a RegionServer for > SplitWalRemoteProcedure, it will retry this server as long as the server is > alive. I believe this is because even though > {{RSProcedureDispatcher.ExecuteProceduresRemoteCall.run}} calls > {{{}remoteCallFailed{}}}, there is no logic after this to select a new target > server. For {{TransitRegionStateProcedure}} there is logic to select a new > server for opening a region, using {{{}forceNewPlan{}}}. But > SplitWalRemoteProcedure only has logic to try another server if we receive a > {{DoNotRetryIOException}} in SplitWALRemoteProcedure#complete: > [https://github.com/apache/hbase/blob/780ff56b3f23e7041ef1b705b7d3d0a53fdd05ae/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/SplitWALRemoteProcedure.java#L104-L110] > If we receive any other IOException, we will just retry the target server > forever. Just like in HBASE-28050, if there is a SaslException, this will > never lead to retrying a SplitWalRemoteProcedure on a new server, which can > lead to ServerCrashProcedure never finishing until the target server for > SplitWalRemoteProcedure is restarted. The following log is seen repeatedly, > always sending to the same host. > {code:java} > 2024-01-31 15:59:43,616 WARN [RSProcedureDispatcher-pool-72846] > procedure.SplitWALRemoteProcedure - Failed split of > hdfs:///hbase/WALs/,1704984571464-splitting/1704984571464.1706710908543, > retry... > java.io.IOException: Call to address= failed on local exception: > java.io.IOException: Can not send request because relogin is in progress. > at sun.reflect.GeneratedConstructorAccessor363.newInstance(Unknown > Source) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at org.apache.hadoop.hbase.ipc.IPCUtil.wrapException(IPCUtil.java:239) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:391) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient.access$100(AbstractRpcClient.java:92) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425) > at > org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:420) > at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:114) > at org.apache.hadoop.hbase.ipc.Call.setException(Call.java:129) > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:365) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:174) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:167) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470) > at > org.apache.hbase.thirdparty.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:403) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) > at > org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:750) > Caused by: java.io.IOException: Can not send request because relogin is in > progress. > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.sendRequest0(NettyRpcConnection.java:321) > at > org.apache.hadoop.hbase.ipc.NettyRpcConnection.lambda$sendRequest$4(NettyRpcConnection.java:363) > ... 8 more > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)