[jira] [Comment Edited] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable

Sergey Shelukhin (JIRA) Tue, 22 Jan 2019 15:01:08 -0800


    [ 
https://issues.apache.org/jira/browse/HBASE-21742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16749246#comment-16749246
 ]


Sergey Shelukhin edited comment on HBASE-21742 at 1/22/19 11:00 PM:
--------------------------------------------------------------------

Attempt at a simple fix... shutting down procedure store first, so that 
procedures couldn't be saved during shutdown. 
I'm not sure this is the best approach but I suspect that a proper fix would 
require massive refactoring - all the procedures are currently independent and 
they'd all have to check they are not relying on incorrect state in any class 
during shutdown. For now, it should be enough to  at least prevent master from 
saving any state that could be incorrect - it's still supposed to be able to 
recover if e.g. kill -9 is run against it, or a machine physically dies, so not 
saving state should be ok.
[~allan163] does this make sense to you?


was (Author: sershe):
Attempt at a simple fix... shutting down procedure store first, so that 
procedures couldn't be saved during shutdown. 
I'm not sure this is the best approach but I suspect that a proper fix would 
require massive refactoring - all the procedures are currently independent and 
they'd all have to check they are not relying on incorrect state in any class 
during shutdown. For now, it should be enough to  at least prevent master from 
saving any state that could be incorrect - it's still supposed to be able to 
recover if e.g. kill -9 is run against it, or a machine physically dies, so not 
saving state should be ok.

> master can create bad procedures during abort, making entire cluster unusable
> -----------------------------------------------------------------------------
>
>                 Key: HBASE-21742
>                 URL: https://issues.apache.org/jira/browse/HBASE-21742
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Critical
>         Attachments: HBASE-21742.patch
>
>
> Some small HDFS hiccup causes master and meta RS to fail together. Master 
> goes first:
> {noformat}
> 2019-01-18 08:09:46,790 INFO  [KeepAlivePEWorker-311] 
> zookeeper.MetaTableLocator: Setting hbase:meta (replicaId=0) location in 
> ZooKeeper as meta-rs,17020,1547824792484
> ...
> 2019-01-18 10:01:16,904 ERROR [PEWorker-11] master.HMaster: ***** ABORTING 
> master master,17000,1547604554447: FAILED [blah] *****
> ...
> 2019-01-18 10:01:17,087 INFO  [master/master:17000] 
> assignment.AssignmentManager: Stopping assignment manager
> {noformat}
> Bunch of stuff keeps happening, including procedure retries, which is also 
> suspect, but not the point here:
> {noformat}
> 2019-01-18 10:01:21,598 INFO  [PEWorker-3] procedure2.TimeoutExecutorThread: 
> ADDED pid=104031, state=WAITING_TIMEOUT:REGION_STATE_TRANSITION_CLOSE, ... 
> {noformat}
> Then the meta RS decides it's time to go:
> {noformat}
> 2019-01-18 10:01:25,319 INFO  [RegionServerTracker-0] 
> master.RegionServerTracker: RegionServer ephemeral node deleted, processing 
> expiration [meta-rs,17020,1547824792484]
> ...
> 2019-01-18 10:01:25,463 INFO  [RegionServerTracker-0] 
> assignment.AssignmentManager: Added meta-rs,17020,1547824792484 to dead 
> servers which carryingMeta=false, submitted ServerCrashProcedure pid=104313
> {noformat}
> Note that the SCP for this server has meta=false, even though it is holding 
> the meta. That is because, as per above "Stopping assigment manager", AM 
> state including region map got cleared.
> This SCP gets persisted, so when the next master starts, it waits forever for 
> meta to be onlined, while there's no SCP with meta=true to online it.
> The only way around this is to delete the procv2 WAL - master has all the 
> information here, as it often does in bugs I've found recently, but some 
> split brain procedures cause it to get stuck one way or another.
> I will file a separate bug about that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (HBASE-21742) master can create bad procedures during abort, making entire cluster unusable

Reply via email to