[ 
https://issues.apache.org/jira/browse/HBASE-24526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17149736#comment-17149736
 ] 

Nick Dimiduk commented on HBASE-24526:
--------------------------------------

One curiosity in the log, it looks like master believes log splitting is 
complete before the worker has finished. The master first sees the split log 
dir is empty, but when it tries to delete the directory, hdfs refuses because 
there are files present. From the master, I see

{noformat}
2020-07-01 06:17:27,237 INFO 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Start 
pid=116621, state=RUNNABLE:SERVER_CRASH_START, locked=true; 
ServerCrashProcedure server=hbasedn189.example.com,16020,1593583050478, 
splitWal=true, meta=true
...
2020-07-01 06:17:27,241 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
subprocedures=[{pid=116622, ppid=116621, 
state=RUNNABLE:ACQUIRE_SPLIT_WAL_WORKER; 
org.apache.hadoop.hbase.master.procedure.SplitWALProcedure}, {pid=116623, 
ppid=116621, state=RUNNABLE:ACQUIRE_SPLIT_WAL_WORKER; 
org.apache.hadoop.hbase.master.procedure.SplitWALProcedure}]
2020-07-01 06:17:31,449 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished subprocedure 
pid=116623, resume processing parent pid=116621, 
state=RUNNABLE:SERVER_CRASH_DELETE_SPLIT_META_WALS_DIR, locked=true; 
ServerCrashProcedure server=hbasedn189.example.com,16020,1593583050478, 
splitWal=true, meta=true
2020-07-01 06:17:31,449 INFO 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Finished pid=116623, 
ppid=116621, state=SUCCESS; 
org.apache.hadoop.hbase.master.procedure.SplitWALProcedure in 4.2070 sec
2020-07-01 06:17:31,452 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
hdfs://nick-itbll/hbase-itbll/WALs/hbasedn189.example.com,16020,1593583050478-splitting
 dir is empty, no logs to split.
2020-07-01 06:17:31,452 INFO org.apache.hadoop.hbase.master.SplitWALManager: 
size of WALs of hbasedn189.example.com,16020,1593583050478 is 0, isMeta: true
2020-07-01 06:17:31,454 WARN 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: Remove WAL 
directory of server hbasedn189.example.com,16020,1593583050478 failed, ignore...
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.fs.PathIsNotEmptyDirectoryException):
 `/hbase-itbll/WALs/hbasedn189.example.com,16020,1593583050478-splitting is non 
empty': Directory is not empty
...
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:887)
        at 
org.apache.hadoop.hbase.master.SplitWALManager.deleteWALDir(SplitWALManager.java:125)
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.cleanupSplitDir(ServerCrashProcedure.java:259)
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:168)
        at 
org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:65)
{noformat}

> Deadlock executing assign meta procedure
> ----------------------------------------
>
>                 Key: HBASE-24526
>                 URL: https://issues.apache.org/jira/browse/HBASE-24526
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2, Region Assignment
>    Affects Versions: 2.3.0
>            Reporter: Nick Dimiduk
>            Priority: Critical
>
> I have what appears to be a deadlock while assigning meta. During recovery, 
> master creates the assign procedure for meta, and immediately marks meta as 
> assigned in zookeeper. It then creates the subprocedure to open meta on the 
> target region. However, the PEWorker pool is full of procedures that are 
> stuck, I think because their calls to update meta are going nowhere. For what 
> it's worth, the balancer is running concurrently, and has calculated a plan 
> size of 41.
> From the master log,
> {noformat}
> 2020-06-06 00:34:07,314 INFO 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: 
> Starting pid=17802, ppid=17801, 
> state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
> TransitRegionStateProcedure table=hbase:meta, region=1588230740, ASSIGN; 
> state=OPEN, location=null; forceNewPlan=true, retain=false
> 2020-06-06 00:34:07,465 INFO 
> org.apache.hadoop.hbase.zookeeper.MetaTableLocator: Setting hbase:meta 
> (replicaId=0) location in ZooKeeper as 
> hbasedn139.example.com,16020,1591403576247
> 2020-06-06 00:34:07,466 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
> subprocedures=[{pid=17803, ppid=17802, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> {noformat}
> {{pid=17803}} is not mentioned again. hbasedn139 never receives an 
> {{openRegion}} RPC.
> Meanwhile, additional procedures are scheduled and picked up by workers, each 
> getting "stuck". I see log lines for all 16 PEWorker threads, saying that 
> they are stuck.
> {noformat}
> 2020-06-06 00:34:07,961 INFO 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
> for pid=17804, state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; 
> TransitRegionStateProcedure table=IntegrationTestBigLinkedList, 
> region=54f4f6c0e921e6d25e6043cba79c09aa, REOPEN/MOVE
> 2020-06-06 00:34:07,961 INFO 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore: pid=17804 
> updating hbase:meta row=54f4f6c0e921e6d25e6043cba79c09aa, 
> regionState=CLOSING, regionLocation=hbasedn046.example.com,16020,1591402383956
> ...
> 2020-06-06 00:34:22,295 WARN 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck 
> PEWorker-16(pid=17804), run time 14.3340 sec
> ...
> 2020-06-06 00:34:27,295 WARN 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Worker stuck 
> PEWorker-16(pid=17804), run time 19.3340 sec
> ...
> {noformat}
> The cluster stays in this state, with PEWorker thread stuck for upwards of 15 
> minutes. Eventually master starts logging
> {noformat}
> 2020-06-06 00:50:18,033 INFO 
> org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, 
> tries=30, retries=31, started=970072 ms ago, cancelled=false, msg=Call queue 
> is full on hbasedn139.example.com,16020,1591403576247, too many items queued 
> ?, details=row 
> 'IntegrationTestBigLinkedList,,1591398987965.54f4f6c0e921e6d25e6043cba79c09aa.'
>  on table 'hbase:meta' at region=hbase:meta,,1.
> 1588230740, hostname=hbasedn139.example.com,16020,1591403576247, seqNum=-1, 
> see https://s.apache.org/timeout
> {noformat}
> The master never recovers on its own.
> I'm not sure how common this condition might be. This popped after about 20 
> total hours of running ITBLL with ServerKillingMonkey.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to