[jira] [Commented] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable

Sergey Shelukhin (JIRA) Mon, 11 Feb 2019 18:43:43 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765608#comment-16765608
 ]


Sergey Shelukhin commented on HBASE-21742:
------------------------------------------

The same test fails for me, I finally got around to looking... not sure yet 
why. Will look soon. We were running this patch on cluster for some time and it 
seems to be fine.

> master can create bad procedures during abort, making entire cluster unusable
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-21742
>                 URL: https://issues.apache.org/jira/browse/HBASE-21742
>             Project: HBase
>          Issue Type: Bug
>          Components: amv2, meta, Region Assignment
>    Affects Versions: 3.0.0
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>         Attachments: HBASE-21742.patch
>
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: ***** ABORTING 
> master master,17000,1547604554447: FAILED [blah] *****
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> Note that the SCP for this server has meta=false, even though it is holding 
> the meta. That is because, as per above "Stopping assigment manager", AM 
> state including region map got cleared.
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable

Reply via email to