[ 
https://issues.apache.org/jira/browse/HBASE-23593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000215#comment-17000215
 ] 

Josh Elser commented on HBASE-23593:
------------------------------------

{quote}Server we are trying to open a region on, is struggling. Syncs taking a 
second and more, WALs backed-up ~ 100s. It is throwing CallQueueTooBig 
exceptions.
{quote}
Retyping because Jira ate my last comment, sorry for being more brief than I 
was originally.

Wellington and I saw something similar during HBASE-23076, but the issue was 
ultimately that a RS was blocked trying to read the TableDescriptor for the 
Region it was trying to open. Because the assignments were going out on 
priority RPCs, this saturated all of the priority handlers which caused lots of 
problems when the RS was also hosting meta (whose updates from the master are 
also at priority). For us, this was because of over-aggressive locking in hboss.

Mentioning it because I had thoughts that maybe TSRP was spinning too fast 
before landing on the HBoss locking. I would guess your situation is wholly 
different, but thought I'd mention it when alarm-bells started ringing after 
reading your comment. Maybe what your are seeing manifests similarly to how we 
saw things in 23076.

> Stalled SCP Assigns
> -------------------
>
>                 Key: HBASE-23593
>                 URL: https://issues.apache.org/jira/browse/HBASE-23593
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.2.3
>            Reporter: Michael Stack
>            Priority: Major
>
> I'm stuck on this one so doing a write up here in case anyone else has ideas.
> Heavily loaded cluster. Server crashes. SCP cuts in and usually no problem 
> but from time to time I'll see the SCP stuck waiting on an Assign to finish. 
> The assign seems stuck at the queuing of the OpenRegionProcedure. We've 
> stored the procedure but then not a peek thereafter. Later we'll see 
> complaint that the region is STUCK. Doesn't recover. Doesn't run.
> Basic story is as follows:
> Server dies:
> {code}
>  2019-12-17 11:10:42,002 INFO 
> org.apache.hadoop.hbase.master.RegionServerTracker: RegionServer ephemeral 
> node deleted, processing expiration [s011.example.org,16020,1576561318119]
>  2019-12-17 11:10:42,002 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Added s011.example.org,16020,1576561318119; numProcessing=1
> ...
>  2019-12-17 11:10:42,110 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Started processing s011.example.org,16020,1576561318119; numProcessing=1
> {code}
> The dead server restarts which purges the old server from dead server and 
> processing lists:
> {code}
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.DeadServer: 
> Removed s011.example.org,16020,1576561318119, processing=true, numProcessing=0
>  2019-12-17 11:10:58,145 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> STARTUP: Server s011.example.org,16020,1576581054424 came back up, removed it 
> from the dead servers list
> {code}
>  
> ....even though we are still processing logs in the SCP of the old server...
> {code}
>  2019-12-17 11:10:58,392 INFO org.apache.hadoop.hbase.wal.WALSplitUtil: 
> Archived processed log 
> hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting/s011.example.org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
>  to hdfs://nameservice1/hbase/oldWALs/s011.example.                           
>  
> org%2C16020%2C1576561318119.s011.example.org%2C16020%2C1576561318119.regiongroup-0.1576580737491
> {code}
> I thought early purge of deadserver was a problem but I don't think so after 
> study.
> WALS split took two minutes to split and server was removed from dead 
> servers...  three minutes earlier...
> {code}
>  2019-12-17 11:13:05,356 INFO org.apache.hadoop.hbase.master.SplitLogManager: 
> Finished splitting (more than or equal to) 30.6G (32908464448 bytes) in 228 
> log files in 
> [hdfs://nameservice1/hbase/WALs/s011.example.org,16020,1576561318119-splitting]
>  in 143236ms
> {code}
>  Almost immediately we get this:
> {code}
>  2019-12-17 11:14:08,649 WARN 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: STUCK 
> Region-In-Transition state=OPEN, 
> location=s011.example.org,16020,1576561318119, table=t1, 
> region=9d6d6d5f261a0cbe7c9e85091f2c2bd4
> {code}
> For this region assign, I see the SCP proc making an assign for this region 
> which then makes a subtask to OpenRegionProcedure. This is where it gets 
> stuck. No progress after this. The procedure does not come alive to run.
> Here are logs for the ORP pid=421761:
> {code}
> 2019-12-17 11:38:34,761 INFO 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: Initialized 
> subprocedures=[{pid=421761, ppid=402475, state=RUNNABLE; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure}]
> 2019-12-17 11:38:34,765 DEBUG 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Add 
> TableQueue(t1, xlock=false sharedLock=3144 size=427) to run queue because: 
> the exclusive lock is not held by anyone when adding pid=421761, ppid=402475, 
> state=RUNNABLE; org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> 2019-12-17 11:38:34,770 DEBUG 
> org.apache.hadoop.hbase.procedure2.RootProcedureState: Add procedure 
> pid=421761, ppid=402475, state=RUNNABLE, locked=true; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure as the 3193th 
> rollback step
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to