Haiping lv created HBASE-28023:
----------------------------------
Summary: ITBLL's RollingBatchSuspendResumeRsAction runs the
"suspendRs" method to perform the action, but it inadvertently uses the
"waitForRegionServerToStop" method to check if it was executed successfully.
Key: HBASE-28023
URL: https://issues.apache.org/jira/browse/HBASE-28023
Project: HBase
Issue Type: Bug
Affects Versions: 3.0.0-alpha-1
Reporter: Haiping lv
When running ITBLL, a problem occurs that ultimately results in all region
servers being suspended.
The following is the ITBLL running command:
{code:java}
hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList
-DIntegrationTestBigLinkedList.table=itbll -m slowDeterministic Loop 10 10
10000000 /tmp/biglinkedlist 100 {code}
I have summarized the process as follows:
# The Action RollingBatchSuspendResumeRsAction in ITBLL will execute the "sudo
-u hbase ps ux | grep proc_regionserver | grep -v grep | tr -s ' ' | cut -d ' '
-f2 | xargs kill -s SIGSTOP" command to suspend the RegionServer process.
# This command will pause the RegionServer process, rather than kill it.
# The Action uses the waitForServiceToStop method to check if the execution
was successful, using the "sudo -u hbase ps ux | grep proc_regionserver | grep
-v grep | tr -s ' ' | cut -d ' ' -f2" command.
# The waitForServiceToStop method used to check if the execution was
successful does not match the suspendRs, causing ITBLL to not resume the
RegionServer process and ultimately resulting in all RegionServer processes
being suspended. Therefore, ITBLL fails to run.
{code:java}
2023-07-21 11:18:23,103 WARN [ChaosMonkey-2] policies.Policy
(DoActionsOncePolicy.java:runOneIteration(51)) - Exception occurred during
performing action: java.io.IOException: Timed-out waiting for service to stop:
core-1-3.c-c25e3e8da545bfd2.cn-hangzhou.emr.aliyuncs.com,16020,1689908619650
at
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForServiceToStop(DistributedHBaseCluster.java:282)
at
org.apache.hadoop.hbase.DistributedHBaseCluster.waitForRegionServerToStop(DistributedHBaseCluster.java:131)
at
org.apache.hadoop.hbase.chaos.actions.Action.suspendRs(Action.java:200)
at
org.apache.hadoop.hbase.chaos.actions.RollingBatchSuspendResumeRsAction.perform(RollingBatchSuspendResumeRsAction.java:97)
at
org.apache.hadoop.hbase.chaos.policies.DoActionsOncePolicy.runOneIteration(DoActionsOncePolicy.java:48)
at
org.apache.hadoop.hbase.chaos.policies.PeriodicPolicy.run(PeriodicPolicy.java:41)
at
org.apache.hadoop.hbase.chaos.policies.CompositeSequentialPolicy.run(CompositeSequentialPolicy.java:42)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)