[ https://issues.apache.org/jira/browse/HBASE-21093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16588866#comment-16588866 ]
Duo Zhang commented on HBASE-21093: ----------------------------------- Seems the difference is how fast we process sub procedures... For a successful run, the PEWorker will switch procedures very quickly, but the failing run, we can see this {noformat} 2018-08-22 08:21:06,668 INFO [PEWorker-1] procedure.MasterProcedureScheduler(689): pid=576, ppid=75, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; TransitRegionStateProcedure table=TestMRegions, region=42b8c3f44612af35705d88fdacd3b9c1, ASSIGN checking lock on 42b8c3f44612af35705d88fdacd3b9c1 2018-08-22 08:21:06,745 INFO [PEWorker-1] assignment.TransitRegionStateProcedure(159): Starting pid=576, ppid=75, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; TransitRegionStateProcedure table=TestMRegions, region=42b8c3f44612af35705d88fdacd3b9c1, ASSIGN; rit=OFFLINE, location=asf916.gq1.ygridcore.net,41001,1534925965205; forceNewPlan=false, retain=false 2018-08-22 08:21:06,860 INFO [PEWorker-1] procedure.MasterProcedureScheduler(689): pid=575, ppid=75, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; TransitRegionStateProcedure table=TestMRegions, region=d0f9c85da53b9000b636ca5cc613ad26, ASSIGN checking lock on d0f9c85da53b9000b636ca5cc613ad26 2018-08-22 08:21:06,944 INFO [PEWorker-1] assignment.TransitRegionStateProcedure(159): Starting pid=575, ppid=75, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; TransitRegionStateProcedure table=TestMRegions, region=d0f9c85da53b9000b636ca5cc613ad26, ASSIGN; rit=OFFLINE, location=asf916.gq1.ygridcore.net,41001,1534925965205; forceNewPlan=false, retain=false 2018-08-22 08:21:07,045 INFO [PEWorker-1] assignment.RegionStateStore(199): pid=576 updating hbase:meta row=42b8c3f44612af35705d88fdacd3b9c1, regionState=OPENING, regionLocation=asf916.gq1.ygridcore.net,41001,1534925965205 2018-08-22 08:21:07,050 INFO [PEWorker-1] procedure2.ProcedureExecutor(1612): Initialized subprocedures=[{pid=577, ppid=576, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] 2018-08-22 08:21:07,362 INFO [PEWorker-1] assignment.RegionStateStore(199): pid=575 updating hbase:meta row=d0f9c85da53b9000b636ca5cc613ad26, regionState=OPENING, regionLocation=asf916.gq1.ygridcore.net,41001,1534925965205 2018-08-22 08:21:07,366 INFO [PEWorker-1] procedure2.ProcedureExecutor(1612): Initialized subprocedures=[{pid=578, ppid=575, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] 2018-08-22 08:21:07,751 INFO [PEWorker-1] procedure.MasterProcedureScheduler(689): pid=574, ppid=75, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; TransitRegionStateProcedure table=TestMRegions, region=e23509e402adee724763cf4f48cd86a5, ASSIGN checking lock on e23509e402adee724763cf4f48cd86a5 2018-08-22 08:21:07,834 INFO [PEWorker-1] assignment.TransitRegionStateProcedure(159): Starting pid=574, ppid=75, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; TransitRegionStateProcedure table=TestMRegions, region=e23509e402adee724763cf4f48cd86a5, ASSIGN; rit=OFFLINE, location=asf916.gq1.ygridcore.net,41001,1534925965205; forceNewPlan=false, retain=false 2018-08-22 08:21:07,918 INFO [PEWorker-1] procedure.MasterProcedureScheduler(689): pid=573, ppid=75, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; TransitRegionStateProcedure table=TestMRegions, region=2c050ee558571b09362e3c63f9b75dfb, ASSIGN checking lock on 2c050ee558571b09362e3c63f9b75dfb 2018-08-22 08:21:08,001 INFO [PEWorker-1] assignment.TransitRegionStateProcedure(159): Starting pid=573, ppid=75, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=true; TransitRegionStateProcedure table=TestMRegions, region=2c050ee558571b09362e3c63f9b75dfb, ASSIGN; rit=OFFLINE, location=asf916.gq1.ygridcore.net,41001,1534925965205; forceNewPlan=false, retain=false 2018-08-22 08:21:08,084 INFO [PEWorker-1] assignment.RegionStateStore(199): pid=574 updating hbase:meta row=e23509e402adee724763cf4f48cd86a5, regionState=OPENING, regionLocation=asf916.gq1.ygridcore.net,41001,1534925965205 2018-08-22 08:21:08,088 INFO [PEWorker-1] procedure2.ProcedureExecutor(1612): Initialized subprocedures=[{pid=579, ppid=574, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] 2018-08-22 08:21:08,427 INFO [PEWorker-1] assignment.RegionStateStore(199): pid=573 updating hbase:meta row=2c050ee558571b09362e3c63f9b75dfb, regionState=OPENING, regionLocation=asf916.gq1.ygridcore.net,41001,1534925965205 2018-08-22 08:21:08,431 INFO [PEWorker-1] procedure2.ProcedureExecutor(1612): Initialized subprocedures=[{pid=580, ppid=573, state=RUNNABLE, hasLock=false; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}] 2018-08-22 08:21:08,787 INFO [PEWorker-1] procedure.MasterProcedureScheduler(689): pid=572, ppid=75, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, hasLock=false; TransitRegionStateProcedure table=TestMRegions, region=eb540609fcb5e81d2965eb05b1d9cc14, ASSIGN checking lock on eb540609fcb5e81d2965eb05b1d9cc14 {noformat} Every time after the 'Initialized subprocedures' log, there will be a 300-400ms delay. So the problem is that, the procedure store runs too slow? This could also explain that why the old implementation can perform better, as we have less sub procedures... Anyway, let me move this test to a separated file, with more PEWorkers... This is not what we want to test here, using a single thread to assign 500 regions... > Increase the dispatch delay for testing DDL procedures > ------------------------------------------------------ > > Key: HBASE-21093 > URL: https://issues.apache.org/jira/browse/HBASE-21093 > Project: HBase > Issue Type: Sub-task > Components: test > Reporter: Duo Zhang > Assignee: Duo Zhang > Priority: Major > Attachments: HBASE-21093.patch > > > In TestTableDDLProcedureBase we set the procedure worker number to 1, so on a > slow machine, we will fail to batch the remote calls to RS with the default > dispatch delay, and lead to a very long time to finish the bunch of sub TRSPs. -- This message was sent by Atlassian JIRA (v7.6.3#76005)