More procedure workers (hbase.master.procedure.threads 5 => 16) can reduce the time from 8 mins to 6 mins. But branch-2.3 used the same 5 procedure workers.....
Guanghao Zhang <zghao...@gmail.com> 于2020年10月22日周四 下午2:22写道: > And I am sure that it is not the lock problem, because there is no > "Waiting on xlock for" log. > > LOG.info("Waiting on xlock for {} held by pid={}", procedure, > regionLocks[i].getExclusiveLockProcIdOwner()); > > Guanghao Zhang <zghao...@gmail.com> 于2020年10月22日周四 下午2:12写道: > >> Run the same UT TestRegionReplicaFailover on my local PC, mvn clean test >> -Dtest=TestRegionReplicaFailover, branch-2.2 takes 8 mins but branch-2.3 >> only needs 2 mins. >> >> I found the problem is related to procedure schedule. See the below log: >> 2020-10-21 13:52:28,097 INFO [PEWorker-1] >> procedure2.ProcedureExecutor(1427): Finished pid=296, ppid=45, >> state=SUCCESS; >> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure in 1.6250sec >> 2020-10-21 13:52:28,538 INFO [PEWorker-3] >> procedure2.ProcedureExecutor(1427): Finished pid=45, ppid=20, >> state=SUCCESS; TransitRegionStateProcedure table=testLotsOfRegionRepli2, >> region=50703895da3cb8c942d3197600d549bc, ASSIGN in 59.4330sec >> >> The real assign procedure only cost 1.6 seconds but >> the TransitRegionStateProcedure cost 59.4 seconds. The pid=45 procedure was >> initialized at 2020-10-21 13:51:28,666. It was added to TableQueue >> at 2020-10-21 13:51:28,789. But took xlock to run at 2020-10-21 >> 13:52:24,761. See the below log: >> 2020-10-21 13:51:28,789 DEBUG [PEWorker-4] >> procedure.MasterProcedureScheduler(352): Add >> TableQueue(testLotsOfRegionRepli2, xlock=true (20) sharedLock=0 size=25) to >> run queue because: pid=45, ppid=20, >> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; >> TransitRegionStateProcedure table=testLotsOfRegionRepli2, >> region=50703895da3cb8c942d3197600d549bc, ASSIGN has the excusive lock access >> 2020-10-21 13:52:24,761 INFO [PEWorker-2] >> procedure.MasterProcedureScheduler(737): Took xlock for pid=45, ppid=20, >> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; >> TransitRegionStateProcedure table=testLotsOfRegionRepli2, >> region=50703895da3cb8c942d3197600d549bc, ASSIGN >> >> >> But when I tried this UT on another PC, it only cost 2 mins, which is the >> same with branch-2.3. It is weird. Any ideas about this? Thanks. >> >