[jira] [Resolved] (HBASE-29400) RollingBatchRestartRsAction may fail to start region server

Duo Zhang (Jira) Wed, 09 Jul 2025 08:58:08 -0700


     [ 
https://issues.apache.org/jira/browse/HBASE-29400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Duo Zhang resolved HBASE-29400.
-------------------------------
    Fix Version/s: 2.7.0
                   3.0.0-beta-2
                   2.5.12
                   2.6.4
     Hadoop Flags: Reviewed
     Release Note: 


         Assignee: Duo Zhang
       Resolution: Fixed

Pushed to all active branches.

Thanks [~lupeng] for reviewing!

> RollingBatchRestartRsAction may fail to start region server
> -----------------------------------------------------------
>
>                 Key: HBASE-29400
>                 URL: https://issues.apache.org/jira/browse/HBASE-29400
>             Project: HBase
>          Issue Type: Improvement
>          Components: integration tests
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.7.0, 3.0.0-beta-2, 2.5.12, 2.6.4
>
>
> {noformat}
> 2025-06-17T02:56:52,093 INFO  [ChaosMonkey-2 {}] 
> actions.RollingBatchRestartRsAction: Killing regionserver 
> data04,16020,1750098538006
> 2025-06-17T02:56:52,093 INFO  [ChaosMonkey-2 {}] 
> hbase.DistributedHBaseCluster: Aborting RS: data04,16020,1750098538006
> 2025-06-17T02:56:52,093 INFO  [ChaosMonkey-2 {}] hbase.HBaseClusterManager: 
> Executing remote command: ps ux | grep proc_regionserver | grep -v grep | tr 
> -s ' ' | cut -d ' ' -f2 | xargs kill -s SIGKILL, hostname:data04
> 2025-06-17T02:56:52,093 INFO  [ChaosMonkey-2 {}] util.Shell: Executing full 
> command [timeout 30 /usr/bin/ssh -i /home/zhangduo/.ssh/id_rsa_cluster  -o 
> ConnectTimeout=10 data04 "source /etc/profile; 
> HBASE_CONF_DIR=/data/conf/hbase/conf setsid ps ux | grep proc_regionserver | 
> grep -v grep | tr -s ' ' | cut -d ' ' -f2 | xargs kill -s SIGKILL"]
> 2025-06-17T02:56:52,544 INFO  [ChaosMonkey-2 {}] hbase.HBaseClusterManager: 
> Executed remote command, exit code:0 , output:
> 2025-06-17T02:56:52,544 INFO  [ChaosMonkey-2 {}] 
> hbase.DistributedHBaseCluster: Waiting for service: regionserver to stop: 
> data04,16020,1750098538006
> 2025-06-17T02:56:52,544 INFO  [ChaosMonkey-2 {}] hbase.HBaseClusterManager: 
> Executing remote command: ps ux | grep proc_regionserver | grep -v grep | tr 
> -s ' ' | cut -d ' ' -f2, hostname:data04
> 2025-06-17T02:56:52,544 INFO  [ChaosMonkey-2 {}] util.Shell: Executing full 
> command [timeout 30 /usr/bin/ssh -i /home/zhangduo/.ssh/id_rsa_cluster  -o 
> ConnectTimeout=10 data04 "source /etc/profile; 
> HBASE_CONF_DIR=/data/conf/hbase/conf setsid ps ux | grep proc_regionserver | 
> grep -v grep | tr -s ' ' | cut -d ' ' -f2"]
> 2025-06-17T02:56:52,803 INFO  [ChaosMonkey-2 {}] hbase.HBaseClusterManager: 
> Executed remote command, exit code:0 , output:
> 2025-06-17T02:56:52,809 INFO  [ChaosMonkey-2 {}] 
> actions.RollingBatchRestartRsAction: Killed regionserver 
> data04,16020,1750098538006. Reported num of rs:5
> 2025-06-17T02:56:52,809 INFO  [ChaosMonkey-2 {}] 
> actions.RollingBatchRestartRsAction: Sleeping for:354
> 2025-06-17T02:56:53,163 INFO  [ChaosMonkey-2 {}] 
> actions.RollingBatchRestartRsAction: Starting regionserver data04:16020
> 2025-06-17T02:56:53,163 INFO  [ChaosMonkey-2 {}] 
> hbase.DistributedHBaseCluster: Starting RS on: data04
> 2025-06-17T02:56:53,163 INFO  [ChaosMonkey-2 {}] hbase.HBaseClusterManager: 
> Executing remote command: 
> /home/zhangduo/packages/hbase/hbase/bin/hbase-daemon.sh  start regionserver, 
> hostname:data04
> 2025-06-17T02:56:53,163 INFO  [ChaosMonkey-2 {}] util.Shell: Executing full 
> command [timeout 30 /usr/bin/ssh -i /home/zhangduo/.ssh/id_rsa_cluster  -o 
> ConnectTimeout=10 data04 "source /etc/profile; 
> HBASE_CONF_DIR=/data/conf/hbase/conf setsid 
> /home/zhangduo/packages/hbase/hbase/bin/hbase-daemon.sh  start regionserver"]
> 2025-06-17T02:56:53,473 INFO  [ChaosMonkey-2 {}] hbase.HBaseClusterManager: 
> Executed remote command, exit code:0 , output:regionserver running as process 
> 1948033. Stop it first.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Resolved] (HBASE-29400) RollingBatchRestartRsAction may fail to start region server

Reply via email to