[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979442#comment-16979442 ] Michael Stack commented on HBASE-21035: --- To address more directly [~allan163]'s original description: "Since no one will bring meta region online." We added 'fix' to hbck2 to address above. 'hbck2 --skip assigns 1588230740' will force assign of hbase:meta if Master stuck in startup waiting on hbase:meta to online AND no Procedure to assign hbase:meta (because for example, MasterProcWALs have all been removed or are corrupt). See the help usage from hbck2 for the assigns command. Will also work for namespace region assign (HBASE-21156). {quote} assigns [OPTIONS] ... Options: -o,--override override ownership by another procedure A 'raw' assign that can be used even during Master initialization (if the -skip flag is specified). Skirts Coprocessors. Pass one or more encoded region names. 1588230740 is the hard-coded name for the hbase:meta region and de00010733901a05f5a2a3a382e27dd4 is an example of what a user-space encoded region name looks like. For example: $ HBCK2 assign 1588230740 de00010733901a05f5a2a3a382e27dd4 Returns the pid(s) of the created AssignProcedure(s) or -1 if none.{quote} > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672115#comment-16672115 ] stack commented on HBASE-21035: --- I tried it [~allan163] HBASE-21425 What versions did you try? > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672484#comment-16672484 ] Allan Yang commented on HBASE-21035: {quote} I tried it Allan Yang HBASE-21425 {quote} It's our internal version, based on branch-2.0, with some patch back ported from branch-2.1 > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673369#comment-16673369 ] stack commented on HBASE-21035: --- Thanks [~allan163]. Yours was not the issue over in HBASE-21425? You could not bring the meta online? hbck2 is not an option on your hbase2 version? Not yet perhaps? > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673987#comment-16673987 ] Allan Yang commented on HBASE-21035: [~stack], no, it is not. The reason is that I stopped RS first, leaving the splitting WAL dir there, andnow master 'thinks' there always a SCP for the splitting WAL, so no one will process these down RS dirs, and no one will bring any regions online. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605123#comment-16605123 ] stack commented on HBASE-21035: --- Late to the party I've been playing with removing the procedure WALs and doing other damage to the cluster to see how well we recover. I ran into this issue here that [~allan163] talks of where there is no assign for meta region on startup; in my case, I'd removed the procedure WAL dirs as Allan does in his test but different from Allan, there was no WAL dir for the server that had been carrying meta; I think it'd been removed across restarts (I didn't check) but I was able to repro by removing the empty WAL dir manually (empty because there'd been a clean shutdown). After reading the above healthy back and forth, while the system seems pretty robust as is -- I had trouble breaking it removing stuff -- and Allan's patch would catch a particular case not covered now, I agree that we need the "assign meta" fix-it in our hbck2 vocabulary. Let me add scheduling of a meta assign (and log search and recovery) as a hbck2 option to the list of fix-its we need in hbck2 as we discussed in person a few weeks back. In my investigations, it seems like we need similar for hbase:namespace table. It can get banjaxed similarly and if not online the cluster is a mess. Master initialization gets stuck trying to read from meta to populate the TableStates. It later gets stuck trying to initialize the TableNamespaceManager if the namespace table is not online. I filed HBASE-21156 > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608185#comment-16608185 ] stack commented on HBASE-21035: --- I'm back. My master is aborting after spending fours reconstructing the assignment state from the reading of 300+ WALs. Master then becomes active. In background there are procedures running and finishing... mostly SCPs. Then my master is dying with: 2018-09-07 22:21:58,968 ERROR org.apache.hadoop.hbase.master.HMaster: * ABORTING master vc0207.halxg.cloudera.com,22001,1536380265734: Unhandled exception. Starting shutdown. * org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=31, exceptions: Fri Sep 07 22:21:58 PDT 2018, null, java.net.SocketTimeoutException: callTimeout=6, callDuration=69864: org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online on vd0412.halxg.cloudera.com,22101,1536380043533 i.e. meta is not online even though the above server's SCP has completed. This is a dirty install so what was up in zk for meta location could be long stale but here we have a state where no SCP's running and meta is not online. I need to write the tool to insert a meta assign. It just takes 4 or 5 hours before I know if it is the fix for this problem. And then there is the scan of the hbase:namespace table next. Thinking of waiting on all SCPs to finish before we do our first meta scan and if meta is still not online, then, auto-schedule the restore meta procedure ... splitting meta logs inline and then assigning meta. Would this violate your principal [~Apache9]? In other words, I need to write the restore meta procedure -- it would split meta logs and then do the assign of meta -- but I think we should auto-schedule it in the case above. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608254#comment-16608254 ] Duo Zhang commented on HBASE-21035: --- If the cluster is not in a good state and several RSes keep crashing, the master will hang there forever, if you need to wait until all SCPs to finish...And how do you determine the meta is on a stale server programmingly? And my concern is that, there will be races, as server crash can happen at any time, include the logic in the start up code path will make the code flaky... We can have a tool to do something like this, But when to use it should be decided by human. Maybe we could do more checks and print something in log that the state seems not correct, please check XXX and XXX to see if there are something wrong and try XXX tool? > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608285#comment-16608285 ] stack commented on HBASE-21035: --- Let me do what you suggest first. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608297#comment-16608297 ] Duo Zhang commented on HBASE-21035: --- I think one thing we could do is that, collect the WAL directory which ends with the splitting suffix, and check whether there’s a SCP associated with it, just like what [~allan163] has done here. But instead of scheduling a SCP directly, we just log it and tell the operators that something may go wrong. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610238#comment-16610238 ] Allan Yang commented on HBASE-21035: {quote} i.e. meta is not online even though the above server's SCP has completed. {quote} [~stack], sir, I don't get it, how can SCP finish without meta online? No region to assign? > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610503#comment-16610503 ] stack commented on HBASE-21035: --- bq. stack, sir, I don't get it, how can SCP finish without meta online? No region to assign? Its hard to trace why on my big cluster, but we somehow lose accounting of meta after a bunch of crashing and restarting of master. I've also been playing with the startup sequence which probably messed things up (Masters do not progress beyond waitForMasterActive -- they don't get to the run method it seems). Anyways, I can get into a state where all SCPs are done but meta is not online (Last night meta was in the OPENING state after all SCPs were done). If meta is not online and we can't scan it to loadMeta, then the Master shutsdown after a minute. I'm working on having the Master hold before the first meta scan if it can't find an online meta for the operator to insert an assign at least (I think we might just auto-assign if all SCPs are done and there is still no meta online). We need the 'hold' at least. Replaying all the WALs can take a while. It would be frustrating to the operator watching hundreds of backed-up WALs replaying and then the Master exits when done. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610520#comment-16610520 ] Duo Zhang commented on HBASE-21035: --- I think [~allan163] has a valid point. For SCP, if meta is not online, i.e, we haven't finished the loadMeta yet for AssignmentManager, then SCP will be suspended. So it is a bit strange that all the SCPs are finished but the meta is still not online... > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610534#comment-16610534 ] stack commented on HBASE-21035: --- bq. I think Allan Yang has a valid point. I suppose they have not regions to assign. Let me try and do a better accounting. Thanks boys. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610550#comment-16610550 ] stack commented on HBASE-21035: --- Oh, I should have mentioned, I removed all master proc WALs at one stage (too many, no tooling yet to fix STUCK assigns, each start up accumulated more WALs). That would explain "nothing to assign" and why hbase:meta had no assign. Now I'm in the state where Master exits because zk has a location for meta, we keep trying to go there but it will never be open at the zk location. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611514#comment-16611514 ] Allan Yang commented on HBASE-21035: {quote} Oh, I should have mentioned, I removed all master proc WALs at one stage (too many, no tooling yet to fix STUCK assigns, each start up accumulated more WALs). That would explain "nothing to assign" and why hbase:meta had no assign. Now I'm in the state where Master exits because zk has a location for meta, we keep trying to go there but it will never be open at the zk location. {quote} It can explain, then. You encounter the same case I have -- 'no one bring up the meta table if all procedure lost'. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611629#comment-16611629 ] stack commented on HBASE-21035: --- .002 is a hack on top of [~allan163] 's patch. I used it doing fixup on a cluster here, one where I had to remove procedure WALs because too many too process and when done, a bunch of state needed repair. It includes some miscellaneous but main thing is assign of meta and of namespace so master startup can continue (without these, master exits because it can't scan a meta, and later a namespace table). I've also been playing around with a FixMetaProcedure that does force online. Tricky part is finding all the meta WALs and then doing the inline split. Trying to see if I should remove meta files when done and trying to figure how to fence off the meta if it open already. Will be back later. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611647#comment-16611647 ] Allan Yang commented on HBASE-21035: {quote} Tricky part is finding all the meta WALs and then doing the inline split {quote} Shouldn't we just trust the meta location in the ZK node? Just split meta log in that server should be enough(We don't have DLR now, so meta region won't suffering multiple crash). > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611658#comment-16611658 ] Duo Zhang commented on HBASE-21035: --- I think the most difficult thing here is that, how do we determine that there is something wrong and we should force meta online to allow hbck2 to do something, as it will be easily to introduce races if we deploy a FixingMetaProcedure but actually there is no problem... [~stack] So the problem here is that, if we can not loadMeta, then the master will exit, and we can do nothing from outside? IIRC the rpc service should have been started, otherwise we can not assign meta. So maybe the problem here is that, we should make master retrying for a longer time before exiting, and add a new method in the rpc service, which is for hbck2 to schedule some recovery procedures? > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612261#comment-16612261 ] stack commented on HBASE-21035: --- bq. Shouldn't we just trust the meta location in the ZK node? Maybe I could look there first. If I find something, go with it. Otherwise, hunt all WAL dirs for .meta WALs. That'd be best. I'll have to take the lease on any WALs I find if only to read them. Maybe I should delete them when done though I think there should be no damage done if they are replayed. I need to figure how to split inline rather than ask our splitting service to do it for us. bq. So maybe the problem here is that, we should make master retrying for a longer time before exiting, and add a new method in the rpc service, which is for hbck2 to schedule some recovery procedures? Was thinking on this. Yeah, let me change the isMeta method here so that rather than schedule the assign of meta procedure, instead I drop logs asking the operator to schedule a meta assign and perhaps a procedure that does the recover of meta WALs. Then do the same for namespace (Namespace needs to go away). Let me try it. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612645#comment-16612645 ] stack commented on HBASE-21035: --- I put up a patch on HBASE-21191 that steals [~allan163]'s patch from here. It has master go into a holding pattern complaining that hbase:meta is not online asking for operator intervention. Ditto for namespace table. Will work on doc on what operators need to do to effect repair in a while after I've backfilled more of the hbck2 stuff. A test over in HBASE-21191 shows that scheduling an hbase:meta assign seems to do the job getting the cluster up again (missing is how to do this from the outside, from a client... working on that next). > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613009#comment-16613009 ] Allan Yang commented on HBASE-21035: Since [~stack] is running into this problem too, I still want to mention that I still think scheduling a SCP for a splitting server is doable(as in my patch)... I have already commit this patch in our internal version. We want to make HBase2.0 in to production, we can't afford to resolve this dilemma using external tools like HBCK2(which is not available yet...) It won't cause any trouble if multiple SCP was submitted for a single server(If you are concern about safe fence [~Apache9] ). And I still can't think of any data loss case or meta inconsistency case resulting of this. And actually, when master starting, [~Apache9] also use a similar way to check if there is already a InitMetaProcedure, scheduling one if not. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613263#comment-16613263 ] Duo Zhang commented on HBASE-21035: --- But for InitMetaProcedure, it could happen in normal case, if master crashes after storing the InitMetaProcedure but before actually executing it. But for SCP, it could not happen in normal case. This is the main difference. We can not program for unknown issues. Removing all the procedure wals is only one possible way to produce this problem, but you do not know if this can also be caused by other problems. Instead, we should provide tools for operator to manually recover the disaster. We also have the same concern for hbase 2.x, that HBCK1 is broken and can not be used, but HBCK2 is still not available. If there are bugs in code which causes critical problems for a cluster, we have no way to get the cluster back. It is not a good idea to tell users that 'if a procedure is stuck, then please just cleanup your cluster and setup a new one'... So let's start helping on HBCK2? > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614231#comment-16614231 ] Allan Yang commented on HBASE-21035: {quote} So let's start helping on HBCK2? {quote} Sure! > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614382#comment-16614382 ] stack commented on HBASE-21035: --- I've been trying to make basic progress on HBCK2. I pushed up an HBCK2 tool that can call our only Hbck method over in the hbase-operator-tools project: https://github.com/apache/hbase-operator-tools/commit/0cf0e0ecf2d4a33522e0e273f9310f11aa2eaee6. It is missing so much -- test, how to package, how to pass in pointer to the cluster to fix, doc., etc., but I'm working on it. Next is adding assign and bulk assign to Hbck Service. This Hbck assign will be different to Admin Assign in that it should work even though the Master is 'initializing' (Admin assign fails because we check master state before we do anything -- which makes it so can't schedule meta assign if it offlined). The hbck assign bypass stuff like calling CPs too. I also want bulk assign -- i.e. passing a thousand regions at a time to assign -- because when doing repairs, clusters will probably be big with lots of regions in odd states. I've been running a fixup job on a cluster where I have thousands of regions in OPENING state (I removed the Master WAL Procs after crashing it... ). Doing assigns one at a time on the command-line doesn't cut it... It takes from 10-40 seconds per assign. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614558#comment-16614558 ] Duo Zhang commented on HBASE-21035: --- We could also help on HBCK2. What is the working flow for it? Github based? > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16615142#comment-16615142 ] stack commented on HBASE-21035: --- [~Apache9] Yes. Github. Am taking the opportunity to do some experiments. Lets play with githhub'ing it. HBASE-21169 is the covering issue. I'll fill in some outline in attached doc for the hbck2 'form'. Would love commentary/push-back. There is little there at the moment. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616711#comment-16616711 ] Duo Zhang commented on HBASE-21035: --- OK I met the same problem with you sir [~stack]. When testing sync replication, the table has not been created(or may be deleted?) at the STANDBY cluster and when transiting the peer to DA the procedure keeps retrying and generated bunch of procedure wals. I created the table and restart the master, and then the master is stuck in loading the procedures. I see that, we call recoverLease on every wal file, sequentially, by one thread, this is really slow... Maybe we need to find a way to speed up it... > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616991#comment-16616991 ] Allan Yang commented on HBASE-21035: Could we just do not update procedure wal if the status is not changed(if possible to do so)? So that there wouldn't be huge procedure wals with useless contents to replay. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618590#comment-16618590 ] Duo Zhang commented on HBASE-21035: --- Theoretically there are changes, since the state has been changed to WAITING_TIMEOUT, and then back to RUNNABLE, and also the retrying count, the timeout value, etc. But is it necessary to persist these stuffs? For me, I do not think it is necessary to restore the WAITING_TIMEOUT state, if there is a crash and restart, then just starts like there is no failure before. For example, when we assign a region and fail, we change the procedure to WAITING_TIMEOUT state and want to sleep for 30 seconds. Then the master crashes and restarts, I think it is OK that we reschedule the TRSP immediately after restarting? What do you guys think? > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618934#comment-16618934 ] Allan Yang commented on HBASE-21035: {quote} Then the master crashes and restarts, I think it is OK that we reschedule the TRSP immediately after restarting? What do you guys think? {quote} Totally agree。 > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619097#comment-16619097 ] stack commented on HBASE-21035: --- So, a filter on state changes and we only persist critical info? Worth investigating, yes. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621413#comment-16621413 ] Duo Zhang commented on HBASE-21035: --- Maybe a flag to indicate whether to persist? And reset it every time. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671049#comment-16671049 ] Allan Yang commented on HBASE-21035: The problem mentioned in this issue need to be discussed again. I tried to update a 1.x cluster to 2.x. Firstly, I stopped all servers(may leave some splitting wal dirs). Then I started 2.x in place, but since now when master starting, we totally count on the procedures, and won't care about the splitting WAL dirs, some data may loss. And there is no SCPs to bring meta online. This would be a problem for people who want to upgrade their clusters from 1.x to 2.x. [~stack],[~Apache9] > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671101#comment-16671101 ] stack commented on HBASE-21035: --- This has been haunting me. [~elserj] had an issue he hoisted up on to the dev list where namespace would not deploy. Then there is the issue over in the heatmaps JIRA. There is a messy internal issue where we are trying to get more details that hints that it could be a prob. in update. Let me try it in morning. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch, > HBASE-21035.branch-2.1.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576074#comment-16576074 ] Duo Zhang commented on HBASE-21035: --- I think this should a tool in HBCK? As in the normal code path, if all the procedures are lost, we do not know which is the best way to address this problem, force online the meta may introduce more problems... > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576089#comment-16576089 ] Allan Yang commented on HBASE-21035: {quote} I think this should a tool in HBCK? As in the normal code path, if all the procedures are lost, we do not know which is the best way to address this problem, force online the meta may introduce more problems... {quote} IIRC, tools like HBCK are depending on meta online to do scan... > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576096#comment-16576096 ] Duo Zhang commented on HBASE-21035: --- Anyway I'm -1 on doing anything automatically to try to recover if the core systems are broken. Here you are removing all the procedures so the solution maybe fine, as the error can not be recovered any more. But what if it is just a permission problem or something else? You bring meta online and mess up all the data and cause unrecoverable data loss... For me, crash or hang is much much better than doing dangerous operations in code... > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576189#comment-16576189 ] Allan Yang commented on HBASE-21035: {quote} You bring meta online and mess up all the data and cause unrecoverable data loss... {quote} I can't think of any condition that bringing meta online will cause data loss or any other unrecoverable cases… In HBase-1.x, if something wrong with the AssignmentManager, restarting the master can fix the dilemma in most cases. In some catastrophic scenario, we can even delete all the RIT Znodes and assign them again. Since if only HDFS/ZK is normal and RS can work normally, the meta region can online at least no matter what. But, this is not the case with AMv2, All the states and procedures are persisted, restarting master will result in the same state before restarting (we are trying hard to ensure it...). Restarting master won't help like before, and also it is hard to interfere with procedures. That means we are not easy to recover the system if there is any bugs in AMv2(which is very likely...). In some cases, we indeed need to delete all procedures making it clean for recovering. As it addressed in a doc('Fixing regions stuck in transition in HBase 2.0 ') in HBASE-19121. But, if a clean start still causing the system to hang, it is hard to let other fix tools like HBCK to kick in. {quote} For me, crash or hang is much much better than doing dangerous operations in code... {quote} >From my point of view, It is very essential that a production ready system >that can recover without change any code. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576204#comment-16576204 ] Allan Yang commented on HBASE-21035: Or in short, my opinion is that, we have to try our best to make sure meta region can online when master initing, otherwise we can do nothing without meta region. But after HBASE-20708(which is indeed a great patch that simplify the start up process ), we are obviously not. You can try my test case, without HBASE-20708, it can work. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576229#comment-16576229 ] Duo Zhang commented on HBASE-21035: --- What if you lose some edits of meta region and then bring the meta region online? There are some assumption in the code, the renaming of wal directory is done in SCP so we can make sure that there is a SCP if the wal directory is already ended with '-splitting'. If this is not the case, we do not know what to do. In your case it is that you deleted all the procedures, but this is not the only possible case right? The damage is made from outside the system, or from an expected behavior, i.e, a serious bug in the code, so the decision should also be done outside the system. I think we can add an admin method to allow operators to submit a SCP for a special RS, and also add an option to HBCK, which does the same thing in the patch here, scan the wal directory, re-submitting SCP for all the RSes which wal directories are ended with '-splitting'. But I'm strongly against adding this piece of code in normal the master startup path. It is really dangerous. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576233#comment-16576233 ] Hadoop QA commented on HBASE-21035: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 13s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} branch-2.0 Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 36s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 43s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 12s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 5s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 2s{color} | {color:green} branch-2.0 passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 29s{color} | {color:green} branch-2.0 passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 3m 34s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 1m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 1m 42s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 1m 11s{color} | {color:red} hbase-server: The patch generated 5 new + 176 unchanged - 0 fixed = 181 total (was 176) {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:red}-1{color} | {color:red} shadedjars {color} | {color:red} 2m 57s{color} | {color:red} patch has 10 errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 11m 10s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 2m 11s{color} | {color:green} the patch passed {color} | | {color:red}-1{color} | {color:red} javadoc {color} | {color:red} 0m 28s{color} | {color:red} hbase-server generated 1 new + 1 unchanged - 0 fixed = 2 total (was 1) {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green}104m 40s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:red}-1{color} | {color:red} asflicense {color} | {color:red} 0m 18s{color} | {color:red} The patch generated 1 ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}142m 0s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 | | JIRA Issue | HBASE-21035 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12935119/HBASE-21035.branch-2.0.001.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 0a4799b4bac1 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 10:45:36 UTC 2018 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh | | git revision | branch-2.0 / 7ee4aa459c | | maven | version: Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) | | Default Java | 1.8.0_171 | | findbugs | v3.1.0-RC3 | | checkstyle | https://builds.apache.org/job/PreCommit-HBASE-Build/14001/artifact/patchprocess/diff-checkstyle-hbase-server.txt | | shadedjars | https://builds.apache.org/job/PreCommit-HBASE-Build/14001/artifact/patchprocess/patch-shadedjars.txt | | javadoc | https://builds.apac
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576240#comment-16576240 ] Allan Yang commented on HBASE-21035: If scheduling a SCP for servers with '-splitting' is not a good idea, then we can go around. Before HBASE-20708, there is a method called processofflineServersWithOnlineRegions which will schedule a assign procedure for any regions on a dead server. But after HBASE-20708 there isn't( replaced by processOfflineRegions). Can we just bring the logic in processofflineServersWithOnlineRegions back? What I want is the same behave w/ or wo/ HBASE-20708. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost
[ https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576279#comment-16576279 ] Duo Zhang commented on HBASE-21035: --- The goal for HBASE-20708 is to remove the unnecessary scheduling for SCPs, and also remove the usage of RecoverMetaProcedure, if you want to them back then HBASE-20708 is useless... I still stand my point, this is not the normal case as it breaks our assumptions, we can provide tools for operators to override these errors, the operators will take their own risk, but we should not try to address them in the normal code path. And if you think the procedure wal is not stable which may lead to corruption files, please start to make it stable. And also we may introduce something like a backup for the procedure wals to prevent manually damages. > Meta Table should be able to online even if all procedures are lost > --- > > Key: HBASE-21035 > URL: https://issues.apache.org/jira/browse/HBASE-21035 > Project: HBase > Issue Type: Sub-task >Affects Versions: 2.1.0 >Reporter: Allan Yang >Assignee: Allan Yang >Priority: Major > Attachments: HBASE-21035.branch-2.0.001.patch > > > After HBASE-20708, we changed the way we init after master starts. It will > only check WAL dirs and compare to Zookeeper RS nodes to decide which server > need to expire. For servers which's dir is ending with 'SPLITTING', we assure > that there will be a SCP for it. > But, if the server with the meta region crashed before master restarts, and > if all the procedure wals are lost (due to bug, or deleted manually, > whatever), the new restarted master will be stuck when initing. Since no one > will bring meta region online. > Although it is an anomaly case, but I think no matter what happens, we need > to online meta region. Otherwise, we are sitting ducks, noting can be done. -- This message was sent by Atlassian JIRA (v7.6.3#76005)