Lijin Bin created HBASE-23594:
---------------------------------

             Summary: Procedure stuck due to region happen to recorded on two 
servers.
                 Key: HBASE-23594
                 URL: https://issues.apache.org/jira/browse/HBASE-23594
             Project: HBase
          Issue Type: Bug
            Reporter: Lijin Bin


Master log:
{code}
$ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" 
hbase-hbaseadmin-master-100.107.176.225.log.1
2019-12-17 11:24:03,534 DEBUG [KeepAlivePEWorker-20] 
procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
sharedLock=34 size=1662) to run queue because: the exclusive lock is not held 
by anyone when adding pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:22,851 INFO  [KeepAlivePEWorker-17] 
procedure.MasterProcedureScheduler: Took xlock for pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:22,852 INFO  [KeepAlivePEWorker-17] 
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; 
forceNewPlan=true, retain=false
2019-12-17 11:24:22,852 DEBUG [KeepAlivePEWorker-17] 
procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 51669th rollback step
2019-12-17 11:24:22,858 DEBUG [master/100.107.176.225:60000] 
procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
sharedLock=349 size=1666) to run queue because: pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:22,912 INFO  [PEWorker-9] 
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
location=100.107.176.215,60020,1576552834619; 
loc=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,912 INFO  [PEWorker-9] assignment.RegionStateStore: 
pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
regionState=OPENING, regionLocation=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,912 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add 
procedure pid=193706, ppid=187614, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52115th rollback step
2019-12-17 11:24:22,918 WARN  [PEWorker-8] 
assignment.RegionRemoteProcedureBase: Can not add remote operation pid=243482, 
ppid=193706, state=RUNNABLE, locked=true; 
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure for region 
{ENCODED => cf9a4ec6cd890aa6806fb313d71e2ebd, NAME => 
'table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.', STARTKEY 
=> 'user68694', ENDKEY => 'user68703'} to server 
100.107.176.215,60020,1576552834619, this usually because the server is alread 
dead, give up and mark the procedure as complete, the parent procedure will 
take care of this.
2019-12-17 11:24:22,921 DEBUG [PEWorker-8] procedure.MasterProcedureScheduler: 
Add TableQueue(table1w_7, xlock=false sharedLock=331 size=1664) to run queue 
because: pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:22,921 INFO  [PEWorker-8] procedure2.ProcedureExecutor: 
Finished subprocedure pid=243482, resume processing parent pid=193706, 
ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
locked=true; TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:22,921 INFO  [PEWorker-9] 
assignment.TransitRegionStateProcedure: Retry=1 of max=2147483647; pid=193706, 
ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
locked=true; TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING, 
location=100.107.176.215,60020,1576552834619
2019-12-17 11:24:22,921 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add 
procedure pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52157th rollback step
2019-12-17 11:24:22,922 INFO  [PEWorker-9] 
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPENING, location=null; 
forceNewPlan=true, retain=false
2019-12-17 11:24:22,922 DEBUG [PEWorker-9] procedure2.RootProcedureState: Add 
procedure pid=193706, ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, 
locked=true; TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52164th rollback step
2019-12-17 11:24:22,930 DEBUG [master/100.107.176.225:60000] 
procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
sharedLock=331 size=1661) to run queue because: pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:22,943 INFO  [PEWorker-14] 
assignment.TransitRegionStateProcedure: Starting pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPENING, 
location=100.107.164.90,60020,1576553001648; 
loc=100.107.164.90,60020,1576553001648
2019-12-17 11:24:22,943 INFO  [PEWorker-14] assignment.RegionStateStore: 
pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
2019-12-17 11:24:22,944 DEBUG [PEWorker-14] procedure2.RootProcedureState: Add 
procedure pid=193706, ppid=187614, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52284th rollback step
2019-12-17 11:24:23,000 INFO  [PEWorker-11] assignment.RegionStateStore: 
pid=193706 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
regionState=OPEN, openSeqNum=17, 
regionLocation=100.107.164.90,60020,1576553001648
2019-12-17 11:24:23,003 DEBUG [PEWorker-11] procedure.MasterProcedureScheduler: 
Add TableQueue(table1w_7, xlock=false sharedLock=336 size=1635) to run queue 
because: pid=193706, ppid=187614, 
state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:23,003 INFO  [PEWorker-11] procedure2.ProcedureExecutor: 
Finished subprocedure pid=243523, resume processing parent pid=193706, 
ppid=187614, state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, 
locked=true; TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:23,003 DEBUG [KeepAlivePEWorker-17] 
procedure2.RootProcedureState: Add procedure pid=193706, ppid=187614, 
state=SUCCESS, locked=true; TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 52624th rollback step
2019-12-17 11:24:23,005 INFO  [KeepAlivePEWorker-17] 
procedure2.ProcedureExecutor: Finished pid=193706, ppid=187614, state=SUCCESS; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN in 19.7230sec
2019-12-17 11:24:28,902 DEBUG [PEWorker-7] procedure.MasterProcedureScheduler: 
Add TableQueue(table1w_7, xlock=false sharedLock=27 size=1856) to run queue 
because: the exclusive lock is not held by anyone when adding pid=258071, 
ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_GE




2019-12-17 11:24:22,914 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor: Stored pid=243483, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.176.215,60020,1576552834619, splitWal=true, meta=false





2019-12-17 11:24:39,485 INFO  [PEWorker-10] procedure.MasterProcedureScheduler: 
Took xlock for pid=258071, ppid=243483, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN
2019-12-17 11:24:39,486 INFO  [PEWorker-10] 
assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483, 
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; rit=OPEN, location=null; 
forceNewPlan=true, retain=false
2019-12-17 11:24:39,486 DEBUG [PEWorker-10] procedure2.RootProcedureState: Add 
procedure pid=258071, ppid=243483, state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, 
locked=true; TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68029th rollback step
2019-12-17 11:24:39,571 DEBUG [master/100.107.176.225:60000] 
procedure.MasterProcedureScheduler: Add TableQueue(table1w_7, xlock=false 
sharedLock=47 size=1847) to run queue because: pid=258071, ppid=243483, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN has lock
2019-12-17 11:24:39,625 INFO  [PEWorker-16] 
assignment.TransitRegionStateProcedure: Starting pid=258071, ppid=243483, 
state=RUNNABLE:REGION_STATE_TRANSITION_OPEN, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN; openRegion rit=OPEN, 
location=100.107.164.90,60020,1576553001648; 
loc=100.107.164.90,60020,1576553001648
2019-12-17 11:24:39,625 INFO  [PEWorker-16] assignment.RegionStateStore: 
pid=258071 updating hbase:meta row=cf9a4ec6cd890aa6806fb313d71e2ebd, 
regionState=OPENING, regionLocation=100.107.164.90,60020,1576553001648
2019-12-17 11:24:39,626 DEBUG [PEWorker-16] procedure2.RootProcedureState: Add 
procedure pid=258071, ppid=243483, 
state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
TransitRegionStateProcedure table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd, ASSIGN as the 68773th rollback step
2019-12-17 11:26:23,217 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition rit=OPENING, 
location=100.107.164.90,60020,1576553001648, table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd
2019-12-17 11:27:23,217 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition rit=OPENING, 
location=100.107.164.90,60020,1576553001648, table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd
2019-12-17 11:28:06,532 INFO  [master/100.107.176.225:60000.Chore.1] 
master.HMaster: Not running balancer because 1 region(s) in transition: 
[rit=OPENING, location=100.107.164.90,60020,1576553001648, table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd]
2019-12-17 11:28:23,218 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition rit=OPENING, 
location=100.107.164.90,60020,1576553001648, table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd
2019-12-17 11:29:23,218 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition rit=OPENING, 
location=100.107.164.90,60020,1576553001648, table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd
2019-12-17 11:30:23,219 WARN  [ProcExecTimeout] assignment.AssignmentManager: 
STUCK Region-In-Transition rit=OPENING, 
location=100.107.164.90,60020,1576553001648, table=table1w_7, 
region=cf9a4ec6cd890aa6806fb313d71e2ebd
{code}

RegionServer log:
{code}
$ grep "cf9a4ec6cd890aa6806fb313d71e2ebd" 
hbase-hbaseadmin-regionserver-100.107.164.90.log
2019-12-17 11:24:22,972 INFO  
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] 
handler.AssignRegionHandler: Open 
table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
2019-12-17 11:24:22,972 INFO  
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion: 
filter rows disabled with empty filtering pattern for 
region=table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd..
2019-12-17 11:24:22,983 INFO  [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] 
hfile.CacheConfig: Created cacheConfig: cacheDataOnRead=true, 
cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, 
cacheEvictOnClose=true, cacheDataCompressed=false, prefetchOnOpen=false for 
family {NAME => 'cf', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', 
NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', 
CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'DIFF', TTL => 
'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', 
CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 
'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'SNAPPY', 
BLOCKCACHE => 'true', BLOCKSIZE => '65536'} with 
blockCache=org.apache.hadoop.hbase.io.hfile.CombinedBlockCache@1376a714
2019-12-17 11:24:22,983 INFO  [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] 
compactions.CompactionConfiguration: size [128 MB, 2 GB, 2 GB); files [7, 25); 
ratio 1.200000; off-peak ratio 5.000000; throttle point 4294967296; major 
period 0, major jitter 0.500000, min locality to compact 0.000000; tiered 
compaction: max_age 9223372036854775807, incoming window min 6, compaction 
policy for tiered window 
org.apache.hadoop.hbase.regionserver.compactions.ExploringCompactionPolicy, 
single output for minor true, compaction window factory 
org.apache.hadoop.hbase.regionserver.compactions.ExponentialCompactionWindowFactory
2019-12-17 11:24:22,984 INFO  [StoreOpener-cf9a4ec6cd890aa6806fb313d71e2ebd-1] 
regionserver.HStore: Store=cf,  memstore type=DefaultMemStore, 
storagePolicy=HOT, verifyBulkLoads=false, parallelPutCountPrintThreshold=50, 
encoding=DIFF, compression=SNAPPY
2019-12-17 11:24:22,987 INFO  
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] regionserver.HRegion: 
Opened cf9a4ec6cd890aa6806fb313d71e2ebd; next sequenceid=17
2019-12-17 11:24:22,987 INFO  
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] 
regionserver.HRegionServer: Post open deploy tasks for 
table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd., 
openProcId=243523, masterSystemTime=1576553062965
2019-12-17 11:24:22,990 INFO  
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] 
handler.AssignRegionHandler: Opened 
table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd.
2019-12-17 11:24:39,657 WARN  
[RS_OPEN_REGION-regionserver/100.107.164.90:60020-1] 
handler.AssignRegionHandler: Received OPEN for the 
region:table1w_7,user68694,1576484498244.cf9a4ec6cd890aa6806fb313d71e2ebd., 
which is already online
{code}


2019-12-17 11:23:49,598 DEBUG [RegionServerTracker-0] 
procedure2.ProcedureExecutor: Stored pid=187614, 
state=RUNNABLE:SERVER_CRASH_START; ServerCrashProcedure 
server=100.107.165.41,60020,1576552792328, splitWal=true, meta=true

Process ServerCrashProcedure(100.107.165.41,60020,1576552792328) assign region 
cf9a4ec6cd890aa6806fb313d71e2ebd,
region assign to 100.107.176.215,60020,1576552834619, but failed, so retry and 
assign to 100.107.164.90,60020,1576553001648 and open on 
100.107.164.90,60020,1576553001648 success.
Later process ServerCrashProcedure(100.107.176.215,60020,1576552834619) and 
find region cf9a4ec6cd890aa6806fb313d71e2ebd on this server also, than try 
assign this region, and assign the region to 100.107.164.90,60020,1576553001648 
also, but 100.107.164.90,60020,1576553001648 find it has open this region 
already, than the assign stuck.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to