Haiping lv created HBASE-28023:
----------------------------------

             Summary: ITBLL's RollingBatchSuspendResumeRsAction runs the 
"suspendRs" method to perform the action, but it inadvertently uses the 
"waitForRegionServerToStop" method to check if it was executed successfully.
                 Key: HBASE-28023
                 URL: https://issues.apache.org/jira/browse/HBASE-28023
             Project: HBase
          Issue Type: Bug
    Affects Versions: 3.0.0-alpha-1
            Reporter: Haiping lv


When running ITBLL, a problem occurs that ultimately results in all region 
servers being suspended.

The following is the ITBLL running command:
{code:java}
hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList 
-DIntegrationTestBigLinkedList.table=itbll -m slowDeterministic Loop 10 10 
10000000 /tmp/biglinkedlist 100 {code}
I have summarized the process as follows:
 # The Action RollingBatchSuspendResumeRsAction in ITBLL will execute the "sudo 
-u hbase ps ux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' ' 
-f2 | xargs kill -s SIGSTOP" command to suspend the RegionServer process.
 # This command will pause the RegionServer process, rather than kill it.
 # The Action uses the waitForServiceToStop method to check if the execution 
was successful, using the "sudo -u hbase ps ux | grep proc_regionserver | grep 
-v grep | tr -s ' ' | cut -d ' ' -f2" command.
 # The waitForServiceToStop method used to check if the execution was 
successful does not match the suspendRs, causing ITBLL to not resume the 
RegionServer process and ultimately resulting in all RegionServer processes 
being suspended. Therefore, ITBLL fails to run.

{code:java}
2023-07-21 11:18:23,103 WARN  [ChaosMonkey-2] policies.Policy 
(DoActionsOncePolicy.java:runOneIteration(51)) - Exception occurred during 
performing action: java.io.IOException: Timed-out waiting for service to stop: 
core-1-3.c-c25e3e8da545bfd2.cn-hangzhou.emr.aliyuncs.com,16020,1689908619650
        at 
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForServiceToStop(DistributedHBaseCluster.java:282)
        at 
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForRegionServerToStop(DistributedHBaseCluster.java:131)
        at 
org.apache.hadoop.hbase.chaos.actions.Action.suspendRs(Action.java:200)
        at 
org.apache.hadoop.hbase.chaos.actions.RollingBatchSuspendResumeRsAction.perform(RollingBatchSuspendResumeRsAction.java:97)
        at 
org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:48)
        at 
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
        at 
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
 {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to