[ 
https://issues.apache.org/jira/browse/HBASE-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588866#comment-16588866
 ] 

Duo Zhang commented on HBASE-21093:
-----------------------------------

Seems the difference is how fast we process sub procedures...

For a successful run, the PEWorker will switch procedures very quickly, but the 
failing run, we can see this

{noformat}
2018-08-22 08:21:06,668 INFO  [PEWorker-1] 
procedure.MasterProcedureScheduler(689): pid=576, ppid=75, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=TestMRegions, 
region=42b8c3f44612af35705d88fdacd3b9c1, ASSIGN checking lock on 
42b8c3f44612af35705d88fdacd3b9c1
2018-08-22 08:21:06,745 INFO  [PEWorker-1] 
assignment.TransitRegionStateProcedure(159): Starting pid=576, ppid=75, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=TestMRegions, 
region=42b8c3f44612af35705d88fdacd3b9c1, ASSIGN; rit=OFFLINE, 
location=asf916.gq1.ygridcore.net,41001,1534925965205; forceNewPlan=false, 
retain=false
2018-08-22 08:21:06,860 INFO  [PEWorker-1] 
procedure.MasterProcedureScheduler(689): pid=575, ppid=75, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=TestMRegions, 
region=d0f9c85da53b9000b636ca5cc613ad26, ASSIGN checking lock on 
d0f9c85da53b9000b636ca5cc613ad26
2018-08-22 08:21:06,944 INFO  [PEWorker-1] 
assignment.TransitRegionStateProcedure(159): Starting pid=575, ppid=75, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=TestMRegions, 
region=d0f9c85da53b9000b636ca5cc613ad26, ASSIGN; rit=OFFLINE, 
location=asf916.gq1.ygridcore.net,41001,1534925965205; forceNewPlan=false, 
retain=false
2018-08-22 08:21:07,045 INFO  [PEWorker-1] assignment.RegionStateStore(199): 
pid=576 updating hbase:meta row=42b8c3f44612af35705d88fdacd3b9c1, 
regionState=OPENING, regionLocation=asf916.gq1.ygridcore.net,41001,1534925965205
2018-08-22 08:21:07,050 INFO  [PEWorker-1] procedure2.ProcedureExecutor(1612): 
Initialized subprocedures=[{pid=577, ppid=576, state=RUNNABLE, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
2018-08-22 08:21:07,362 INFO  [PEWorker-1] assignment.RegionStateStore(199): 
pid=575 updating hbase:meta row=d0f9c85da53b9000b636ca5cc613ad26, 
regionState=OPENING, regionLocation=asf916.gq1.ygridcore.net,41001,1534925965205
2018-08-22 08:21:07,366 INFO  [PEWorker-1] procedure2.ProcedureExecutor(1612): 
Initialized subprocedures=[{pid=578, ppid=575, state=RUNNABLE, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
2018-08-22 08:21:07,751 INFO  [PEWorker-1] 
procedure.MasterProcedureScheduler(689): pid=574, ppid=75, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=TestMRegions, 
region=e23509e402adee724763cf4f48cd86a5, ASSIGN checking lock on 
e23509e402adee724763cf4f48cd86a5
2018-08-22 08:21:07,834 INFO  [PEWorker-1] 
assignment.TransitRegionStateProcedure(159): Starting pid=574, ppid=75, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=TestMRegions, 
region=e23509e402adee724763cf4f48cd86a5, ASSIGN; rit=OFFLINE, 
location=asf916.gq1.ygridcore.net,41001,1534925965205; forceNewPlan=false, 
retain=false
2018-08-22 08:21:07,918 INFO  [PEWorker-1] 
procedure.MasterProcedureScheduler(689): pid=573, ppid=75, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=TestMRegions, 
region=2c050ee558571b09362e3c63f9b75dfb, ASSIGN checking lock on 
2c050ee558571b09362e3c63f9b75dfb
2018-08-22 08:21:08,001 INFO  [PEWorker-1] 
assignment.TransitRegionStateProcedure(159): Starting pid=573, ppid=75, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; 
TransitRegionStateProcedure table=TestMRegions, 
region=2c050ee558571b09362e3c63f9b75dfb, ASSIGN; rit=OFFLINE, 
location=asf916.gq1.ygridcore.net,41001,1534925965205; forceNewPlan=false, 
retain=false
2018-08-22 08:21:08,084 INFO  [PEWorker-1] assignment.RegionStateStore(199): 
pid=574 updating hbase:meta row=e23509e402adee724763cf4f48cd86a5, 
regionState=OPENING, regionLocation=asf916.gq1.ygridcore.net,41001,1534925965205
2018-08-22 08:21:08,088 INFO  [PEWorker-1] procedure2.ProcedureExecutor(1612): 
Initialized subprocedures=[{pid=579, ppid=574, state=RUNNABLE, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
2018-08-22 08:21:08,427 INFO  [PEWorker-1] assignment.RegionStateStore(199): 
pid=573 updating hbase:meta row=2c050ee558571b09362e3c63f9b75dfb, 
regionState=OPENING, regionLocation=asf916.gq1.ygridcore.net,41001,1534925965205
2018-08-22 08:21:08,431 INFO  [PEWorker-1] procedure2.ProcedureExecutor(1612): 
Initialized subprocedures=[{pid=580, ppid=573, state=RUNNABLE, hasLock=false; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
2018-08-22 08:21:08,787 INFO  [PEWorker-1] 
procedure.MasterProcedureScheduler(689): pid=572, ppid=75, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; 
TransitRegionStateProcedure table=TestMRegions, 
region=eb540609fcb5e81d2965eb05b1d9cc14, ASSIGN checking lock on 
eb540609fcb5e81d2965eb05b1d9cc14
{noformat}

Every time after the 'Initialized subprocedures' log, there will be a 300-400ms 
delay. So the problem is that, the procedure store runs too slow? This could 
also explain that why the old implementation can perform better, as we have 
less sub procedures...

Anyway, let me move this test to a separated file, with more PEWorkers... This 
is not what we want to test here, using a single thread to assign 500 regions...

> Increase the dispatch delay for testing DDL procedures
> ------------------------------------------------------
>
>                 Key: HBASE-21093
>                 URL: https://issues.apache.org/jira/browse/HBASE-21093
>             Project: HBase
>          Issue Type: Sub-task
>          Components: test
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>         Attachments: HBASE-21093.patch
>
>
> In TestTableDDLProcedureBase we set the procedure worker number to 1, so on a 
> slow machine, we will fail to batch the remote calls to RS with the default 
> dispatch delay, and lead to a very long time to finish the bunch of sub TRSPs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to