[ https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16765608#comment-16765608 ]
Sergey Shelukhin commented on HBASE-21742: ------------------------------------------ The same test fails for me, I finally got around to looking... not sure yet why. Will look soon. We were running this patch on cluster for some time and it seems to be fine. > master can create bad procedures during abort, making entire cluster unusable > ----------------------------------------------------------------------------- > > Key: HBASE-21742 > URL: https://issues.apache.org/jira/browse/HBASE-21742 > Project: HBase > Issue Type: Bug > Components: amv2, meta, Region Assignment > Affects Versions: 3.0.0 > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Priority: Critical > Attachments: HBASE-21742.patch > > > Some small HDFS hiccup causes master and meta RS to fail together. Master > goes first: > {noformat} > 2019-01-18 08:09:46,790 INFO [KeepAlivePEWorker-311] > zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in > ZooKeeper as meta-rs,17020,1547824792484 > ... > 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: ***** ABORTING > master master,17000,1547604554447: FAILED [blah] ***** > ... > 2019-01-18 10:01:17,087 INFO [master/master:17000] > assignment.AssignmentManager: Stopping assignment manager > {noformat} > Bunch of stuff keeps happening, including procedure retries, which is also > suspect, but not the point here: > {noformat} > 2019-01-18 10:01:21,598 INFO [PEWorker-3] procedure2.TimeoutExecutorThread: > ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... > {noformat} > Then the meta RS decides it's time to go: > {noformat} > 2019-01-18 10:01:25,319 INFO [RegionServerTracker-0] > master.RegionServerTracker: RegionServer ephemeral node deleted, processing > expiration [meta-rs,17020,1547824792484] > ... > 2019-01-18 10:01:25,463 INFO [RegionServerTracker-0] > assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead > servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313 > {noformat} > Note that the SCP for this server has meta=false, even though it is holding > the meta. That is because, as per above "Stopping assigment manager", AM > state including region map got cleared. > This SCP gets persisted, so when the next master starts, it waits forever for > meta to be onlined, while there's no SCP with meta=true to online it. > The only way around this is to delete the procv2 WAL - master has all the > information here, as it often does in bugs I've found recently, but some > split brain procedures cause it to get stuck one way or another. > I will file a separate bug about that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)