[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2019-11-21 Thread Michael Stack (Jira)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979442#comment-16979442
 ] 

Michael Stack commented on HBASE-21035:
---

To address more directly [~allan163]'s original description:

 "Since no one will bring meta region online."

We added 'fix' to hbck2 to address above.

'hbck2 --skip assigns 1588230740' will force assign of hbase:meta if Master 
stuck in startup waiting on hbase:meta to online AND no Procedure to assign 
hbase:meta (because for example, MasterProcWALs have all been removed or 
are corrupt). See the help usage from hbck2 for the assigns command. Will also 
work for namespace region assign (HBASE-21156).



{quote} assigns [OPTIONS] ...
   Options:
-o,--override  override ownership by another procedure
   A 'raw' assign that can be used even during Master initialization (if
   the -skip flag is specified). Skirts Coprocessors. Pass one or more
   encoded region names. 1588230740 is the hard-coded name for the
   hbase:meta region and de00010733901a05f5a2a3a382e27dd4 is an example of
   what a user-space encoded region name looks like. For example:
 $ HBCK2 assign 1588230740 de00010733901a05f5a2a3a382e27dd4
   Returns the pid(s) of the created AssignProcedure(s) or -1 if none.{quote}

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-11-01 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672115#comment-16672115
 ] 

stack commented on HBASE-21035:
---

I tried it [~allan163]  HBASE-21425

What versions did you try?

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-11-01 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672484#comment-16672484
 ] 

Allan Yang commented on HBASE-21035:


{quote}
I tried it Allan Yang HBASE-21425
{quote}
It's our internal version, based on branch-2.0, with some patch back ported 
from branch-2.1

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-11-02 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673369#comment-16673369
 ] 

stack commented on HBASE-21035:
---

Thanks [~allan163]. Yours was not the issue over in HBASE-21425? You could not 
bring the meta online? hbck2 is not an option on your hbase2 version? Not yet 
perhaps?

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-11-03 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16673987#comment-16673987
 ] 

Allan Yang commented on HBASE-21035:


[~stack], no, it is not. The reason is that I stopped RS first, leaving the 
splitting WAL dir there, andnow master 'thinks' there always a SCP for the 
splitting WAL, so no one will process these down RS dirs, and no one will bring 
any regions online.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-05 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605123#comment-16605123
 ] 

stack commented on HBASE-21035:
---

Late to the party

I've been playing with removing the procedure WALs and doing other damage to 
the cluster to see how well we recover. I ran into this issue here that 
[~allan163] talks of where there is no assign for meta region on startup; in my 
case, I'd removed the procedure WAL dirs as Allan does in his test but 
different from Allan, there was no WAL dir for the server that had been 
carrying meta; I think it'd been removed across restarts (I didn't check) but I 
was able to repro by removing the empty WAL dir manually (empty because there'd 
been a clean shutdown).

After reading the above healthy back and forth, while the system seems pretty 
robust as is -- I had trouble breaking it removing stuff -- and Allan's patch 
would catch a particular case not covered now, I agree that we need the "assign 
meta" fix-it in our hbck2 vocabulary. Let me add scheduling of a meta assign 
(and log search and recovery) as a hbck2 option to the list of fix-its we need 
in hbck2 as we discussed in person a few weeks back. 

In my investigations, it seems like we need similar for hbase:namespace table. 
It can get banjaxed similarly and if not online the cluster is a mess.

Master initialization gets stuck trying to read from meta to populate the 
TableStates. It later gets stuck trying to initialize the TableNamespaceManager 
if the namespace table is not online.

I filed HBASE-21156

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-08 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608185#comment-16608185
 ] 

stack commented on HBASE-21035:
---

I'm back. 

My master is aborting after spending fours reconstructing the assignment state 
from the reading of 300+ WALs. Master then becomes active. In background there 
are procedures running and finishing... mostly SCPs.

Then my master is dying with:

2018-09-07 22:21:58,968 ERROR org.apache.hadoop.hbase.master.HMaster: * 
ABORTING master vc0207.halxg.cloudera.com,22001,1536380265734: Unhandled 
exception. Starting shutdown. *
org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after 
attempts=31, exceptions:
Fri Sep 07 22:21:58 PDT 2018, null, java.net.SocketTimeoutException: 
callTimeout=6, callDuration=69864: 
org.apache.hadoop.hbase.NotServingRegionException: hbase:meta,,1 is not online 
on vd0412.halxg.cloudera.com,22101,1536380043533

i.e. meta is not online even though the above server's SCP has completed.

This is a dirty install so what was up in zk for meta location could be long 
stale but here we have a state where no SCP's running and meta is not 
online.

I need to write the tool to insert a meta assign. It just takes 4 or 5 hours 
before I know if it is the fix for this problem. And then there is the scan of 
the hbase:namespace table next.

Thinking of waiting on all SCPs to finish before we do our first meta scan  
and if meta is still not online, then, auto-schedule the restore meta procedure 
... splitting meta logs inline and then assigning meta. Would this violate your 
principal [~Apache9]?

In other words, I need to write the restore meta procedure -- it would split 
meta logs and then do the assign of meta -- but I think we should auto-schedule 
it in the case above.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-08 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608254#comment-16608254
 ] 

Duo Zhang commented on HBASE-21035:
---

If the cluster is not in a good state and several RSes keep crashing, the 
master will hang there forever, if you need to wait until all SCPs to 
finish...And how do you determine the meta is on a stale server programmingly? 

And my concern is that, there will be races, as server crash can happen at any 
time, include the logic in the start up code path will make the code flaky...

We can have a tool to do something like this, But when to use it should be 
decided by human. Maybe we could do more checks and print something in log that 
the state seems not correct, please check XXX and XXX to see if there are 
something wrong and try XXX tool?

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-08 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608285#comment-16608285
 ] 

stack commented on HBASE-21035:
---

Let me do what you suggest first. 

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-08 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16608297#comment-16608297
 ] 

Duo Zhang commented on HBASE-21035:
---

I think one thing we could do is that, collect the WAL directory which ends 
with the splitting suffix, and check whether there’s a SCP associated with it, 
just like what [~allan163] has done here. But instead of scheduling a SCP 
directly, we just log it and tell the operators that something may go wrong.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-11 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610238#comment-16610238
 ] 

Allan Yang commented on HBASE-21035:


{quote}
i.e. meta is not online even though the above server's SCP has completed.
{quote}
[~stack], sir, I don't get it, how can SCP finish without meta online? No 
region to assign?

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-11 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610503#comment-16610503
 ] 

stack commented on HBASE-21035:
---

bq. stack, sir, I don't get it, how can SCP finish without meta online? No 
region to assign?

Its hard to trace why on my big cluster, but we somehow lose accounting of meta 
after a bunch of crashing and restarting of master. I've also been playing with 
the startup sequence which probably messed things up (Masters do not progress 
beyond waitForMasterActive -- they don't get to the run method it seems). 
Anyways, I can get into a state where all SCPs are done but meta is not online 
(Last night meta was in the OPENING state after all SCPs were done).

If meta is not online and we can't scan it to loadMeta, then the Master 
shutsdown after a minute. I'm working on having the Master hold before the 
first meta scan if it can't find an online meta for the operator to insert an 
assign at least (I think we might just auto-assign if all SCPs are done and 
there is still no meta online). We need the 'hold' at least. Replaying all the 
WALs can take a while. It would be frustrating to the operator watching 
hundreds of backed-up WALs replaying and then the Master exits when done.



> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610520#comment-16610520
 ] 

Duo Zhang commented on HBASE-21035:
---

I think [~allan163] has a valid point. For SCP, if meta is not online, i.e, we 
haven't finished the loadMeta yet for AssignmentManager, then SCP will be 
suspended. So it is a bit strange that all the SCPs are finished but the meta 
is still not online...

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-11 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610534#comment-16610534
 ] 

stack commented on HBASE-21035:
---

bq. I think Allan Yang has a valid point. 

I suppose they have not regions to assign. Let me try and do a better 
accounting.

Thanks boys.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-11 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610550#comment-16610550
 ] 

stack commented on HBASE-21035:
---

Oh, I should have mentioned, I removed all master proc WALs at one stage (too 
many, no tooling yet to fix STUCK assigns, each start up accumulated more 
WALs). That would explain "nothing to assign" and why hbase:meta had no assign. 
Now I'm in the state where Master exits because zk has a location for meta, we 
keep trying to go there but it will never be open at the zk location.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-11 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611514#comment-16611514
 ] 

Allan Yang commented on HBASE-21035:


{quote}
Oh, I should have mentioned, I removed all master proc WALs at one stage (too 
many, no tooling yet to fix STUCK assigns, each start up accumulated more 
WALs). That would explain "nothing to assign" and why hbase:meta had no assign. 
Now I'm in the state where Master exits because zk has a location for meta, we 
keep trying to go there but it will never be open at the zk location.
{quote}
It can explain, then. You encounter the same case I have -- 'no one bring up 
the meta table if all procedure lost'.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-11 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611629#comment-16611629
 ] 

stack commented on HBASE-21035:
---

.002 is a hack on top of [~allan163] 's patch. I used it doing fixup on a 
cluster here, one where I had to remove procedure WALs because too many too 
process and when done, a bunch of state needed repair. It includes some 
miscellaneous but main thing is assign of meta and of namespace so master 
startup can continue (without these, master exits because it can't scan a meta, 
and later a namespace table).

I've also been playing around with a FixMetaProcedure that does force online. 
Tricky part is finding all the meta WALs and then doing the inline split. 
Trying to see if I should remove meta files when done and trying to figure how 
to fence off the meta if it open already.

Will be back later.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-11 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611647#comment-16611647
 ] 

Allan Yang commented on HBASE-21035:


{quote}
 Tricky part is finding all the meta WALs and then doing the inline split
{quote}
Shouldn't we just trust the meta location in the ZK node? Just split meta log 
in that server should be enough(We don't have DLR now, so meta region won't 
suffering multiple crash).

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-11 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16611658#comment-16611658
 ] 

Duo Zhang commented on HBASE-21035:
---

I think the most difficult thing here is that, how do we determine that there 
is something wrong and we should force meta online to allow hbck2 to do 
something, as it will be easily to introduce races if we deploy a 
FixingMetaProcedure but actually there is no problem...

[~stack] So the problem here is that, if we can not loadMeta, then the master 
will exit, and we can do nothing from outside? IIRC the rpc service should have 
been started, otherwise we can not assign meta. So maybe the problem here is 
that, we should make master retrying for a longer time before exiting, and add 
a new method in the rpc service, which is for hbck2 to schedule some recovery 
procedures?

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-12 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612261#comment-16612261
 ] 

stack commented on HBASE-21035:
---

bq. Shouldn't we just trust the meta location in the ZK node?

Maybe I could look there first. If I find something, go with it. Otherwise, 
hunt all WAL dirs for .meta WALs. That'd be best.

I'll have to take the lease on any WALs I find if only to read them. Maybe I 
should delete them when done though I think there should be no damage done if 
they are replayed. I need to figure how to split inline rather than ask our 
splitting service to do it for us.

bq. So maybe the problem here is that, we should make master retrying for a 
longer time before exiting, and add a new method in the rpc service, which is 
for hbck2 to schedule some recovery procedures?

Was thinking on this. Yeah, let me change the isMeta method here so that rather 
than schedule the assign of meta procedure, instead I drop logs asking the 
operator to schedule a meta assign and perhaps a procedure that does the 
recover of meta WALs.  Then do the same for namespace (Namespace needs to go 
away).

Let me try it.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-12 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16612645#comment-16612645
 ] 

stack commented on HBASE-21035:
---

I put up a patch on HBASE-21191 that steals [~allan163]'s patch from here. It 
has master go into a holding pattern complaining that hbase:meta is not online 
asking for operator intervention. Ditto for namespace table. Will work on doc 
on what operators need to do to effect repair in a while after I've backfilled 
more of the hbck2 stuff. A test over in HBASE-21191 shows that scheduling an 
hbase:meta assign seems to do the job getting the cluster up again (missing is 
how to do this from the outside, from a client... working on that next).

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-12 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613009#comment-16613009
 ] 

Allan Yang commented on HBASE-21035:


Since [~stack] is running into this problem too, I still want to mention that I 
still think scheduling a SCP for a splitting server is doable(as in my 
patch)... I have already commit this patch in our internal version. We want to 
make HBase2.0 in to production, we can't afford to resolve this dilemma using 
external tools like HBCK2(which is not available yet...)  
It won't cause any trouble if multiple SCP was submitted for a single server(If 
you are concern about safe fence [~Apache9] ).  And I still can't think of any 
data loss case or meta inconsistency case resulting of this. And actually, when 
master starting, [~Apache9] also use a similar way to check if there is already 
a InitMetaProcedure, scheduling one if not.


> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-13 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613263#comment-16613263
 ] 

Duo Zhang commented on HBASE-21035:
---

But for InitMetaProcedure, it could happen in normal case, if master crashes 
after storing the InitMetaProcedure but before actually executing it. But for 
SCP, it could not happen in normal case. This is the main difference. We can 
not program for unknown issues. Removing all the procedure wals is only one 
possible way to produce this problem, but you do not know if this can also be 
caused by other problems. Instead, we should provide tools for operator to 
manually recover the disaster.

We also have the same concern for hbase 2.x, that HBCK1 is broken and can not 
be used, but HBCK2 is still not available. If there are bugs in code which 
causes critical problems for a cluster, we have no way to get the cluster back. 
It is not a good idea to tell users that 'if a procedure is stuck, then please 
just cleanup your cluster and setup a new one'...

So let's start helping on HBCK2? 

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-13 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614231#comment-16614231
 ] 

Allan Yang commented on HBASE-21035:


{quote}
So let's start helping on HBCK2?
{quote}
Sure!

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-13 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614382#comment-16614382
 ] 

stack commented on HBASE-21035:
---

I've been trying to make basic progress on HBCK2. I pushed up an HBCK2 tool 
that can call our only Hbck method over in the hbase-operator-tools project: 
https://github.com/apache/hbase-operator-tools/commit/0cf0e0ecf2d4a33522e0e273f9310f11aa2eaee6.
 It is missing so much -- test, how to package, how to pass in pointer to the 
cluster to fix, doc., etc., but I'm working on it.

Next is adding assign and bulk assign to Hbck Service. This Hbck assign will be 
different to Admin Assign in that it should work even though the Master is 
'initializing' (Admin assign fails because we check master state before we do 
anything -- which makes it so can't schedule meta assign if it offlined). The 
hbck assign bypass stuff like calling CPs too. I also want bulk assign -- i.e. 
passing a thousand regions at a time to assign -- because when doing repairs, 
clusters will probably be big with lots of regions in odd states. I've been 
running a fixup job on a cluster where I have thousands of regions in OPENING 
state (I removed the Master WAL Procs after crashing it... ). Doing assigns one 
at a time on the command-line doesn't cut it... It takes from 10-40 seconds per 
assign.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-14 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16614558#comment-16614558
 ] 

Duo Zhang commented on HBASE-21035:
---

We could also help on HBCK2. What is the working flow for it? Github based?

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-14 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16615142#comment-16615142
 ] 

stack commented on HBASE-21035:
---

[~Apache9] Yes. Github. Am taking the opportunity to do some experiments. Lets 
play with githhub'ing it. HBASE-21169 is the covering issue. I'll fill in some 
outline in attached doc for the hbck2 'form'. Would love commentary/push-back. 
There is little there at the moment.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-16 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616711#comment-16616711
 ] 

Duo Zhang commented on HBASE-21035:
---

OK I met the same problem with you sir [~stack]. When testing sync replication, 
the table has not been created(or may be deleted?) at the STANDBY cluster and 
when transiting the peer to DA the procedure keeps retrying and generated bunch 
of procedure wals. I created the table and restart the master, and then the 
master is stuck in loading the procedures. I see that, we call recoverLease on 
every wal file, sequentially, by one thread, this is really slow... Maybe we 
need to find a way to speed up it...

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-16 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16616991#comment-16616991
 ] 

Allan Yang commented on HBASE-21035:


Could we just do not update procedure wal if the status is not changed(if 
possible to do so)? So that there wouldn't be huge procedure wals with useless 
contents to replay.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-18 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618590#comment-16618590
 ] 

Duo Zhang commented on HBASE-21035:
---

Theoretically there are changes, since the state has been changed to 
WAITING_TIMEOUT, and then back to RUNNABLE, and also the retrying count, the 
timeout value, etc. But is it necessary to persist these stuffs? For me, I do 
not think it is necessary to restore the WAITING_TIMEOUT state, if there is a 
crash and restart, then just starts like there is no failure before. For 
example, when we assign a region and fail, we change the procedure to 
WAITING_TIMEOUT state and want to sleep for 30 seconds. Then the master crashes 
and restarts, I think it is OK that we reschedule the TRSP immediately after 
restarting? What do you guys think?

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-18 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16618934#comment-16618934
 ] 

Allan Yang commented on HBASE-21035:


{quote}
Then the master crashes and restarts, I think it is OK that we reschedule the 
TRSP immediately after restarting? What do you guys think?
{quote}
Totally agree。

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-18 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16619097#comment-16619097
 ] 

stack commented on HBASE-21035:
---

So, a filter on state changes and we only persist critical info? Worth 
investigating, yes.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-09-19 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621413#comment-16621413
 ] 

Duo Zhang commented on HBASE-21035:
---

Maybe a flag to indicate whether to persist? And reset it every time.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-10-31 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671049#comment-16671049
 ] 

Allan Yang commented on HBASE-21035:


The problem mentioned in this issue need to be discussed again. 
I tried to update a 1.x cluster to 2.x. Firstly, I stopped all servers(may 
leave some splitting wal dirs). Then I started 2.x in place, but since now when 
master starting,  we totally count on the procedures, and won't care about the 
splitting WAL dirs, some data may loss. And there is no SCPs to bring meta 
online.
This would be a problem for people who want to upgrade their clusters from 1.x 
to 2.x. [~stack],[~Apache9]

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-10-31 Thread stack (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16671101#comment-16671101
 ] 

stack commented on HBASE-21035:
---

This has been haunting me. [~elserj] had an issue he hoisted up on to the dev 
list where namespace would not deploy.  Then there is the issue over in the 
heatmaps JIRA. There is a messy internal issue where we are trying to get more 
details that hints that it could be a prob. in update. Let me try it in morning.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch, 
> HBASE-21035.branch-2.1.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-08-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576074#comment-16576074
 ] 

Duo Zhang commented on HBASE-21035:
---

I think this should a tool in HBCK? As in the normal code path, if all the 
procedures are lost, we do not know which is the best way to address this 
problem, force online the meta may introduce more problems...

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-08-10 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576089#comment-16576089
 ] 

Allan Yang commented on HBASE-21035:


{quote}
I think this should a tool in HBCK? As in the normal code path, if all the 
procedures are lost, we do not know which is the best way to address this 
problem, force online the meta may introduce more problems...
{quote}
IIRC, tools like HBCK are depending on meta online to do scan...

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-08-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576096#comment-16576096
 ] 

Duo Zhang commented on HBASE-21035:
---

Anyway I'm -1 on doing anything automatically to try to recover if the core 
systems are broken. Here you are removing all the procedures so the solution 
maybe fine, as the error can not be recovered any more. But what if it is just 
a permission problem or something else? You bring meta online and mess up all 
the data and cause unrecoverable data loss...

For me, crash or hang is much much better than doing dangerous operations in 
code...

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-08-10 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576189#comment-16576189
 ] 

Allan Yang commented on HBASE-21035:


{quote}
You bring meta online and mess up all the data and cause unrecoverable data 
loss...
{quote}
I can't think of any condition that bringing meta online will cause data loss 
or any other unrecoverable cases…

In HBase-1.x, if something wrong with the AssignmentManager, restarting the 
master can fix the dilemma in most cases. In some catastrophic scenario, we can 
even delete all the RIT Znodes and assign them again. Since if only HDFS/ZK is 
normal and RS can work normally, the meta region can online at least no matter 
what.

But, this is not the case with AMv2,  All the states and procedures are 
persisted, restarting master will result in the same state before restarting 
(we are trying hard to ensure it...). Restarting master won't help like before, 
and also it is hard to interfere with procedures. That means we are not easy to 
recover the system if there is any bugs in AMv2(which is very likely...).

In some cases, we indeed need to delete all procedures making it clean for 
recovering. As it addressed in a doc('Fixing regions stuck in transition in 
HBase 2.0
') in HBASE-19121.
But, if a clean start still causing the system to hang, it is hard to let other 
fix tools like HBCK to kick in.

{quote}
For me, crash or hang is much much better than doing dangerous operations in 
code...
{quote}
>From my point of view, It is very essential that a production ready system 
>that can recover without change any code.  


> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-08-10 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576204#comment-16576204
 ] 

Allan Yang commented on HBASE-21035:


Or in short, my opinion is that, we have to try our best to make sure meta 
region can online when master initing, otherwise we can do nothing without meta 
region. But after HBASE-20708(which is indeed a great patch that simplify the 
start up process ), we are obviously not. You can try my test case, without 
HBASE-20708, it can work.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-08-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576229#comment-16576229
 ] 

Duo Zhang commented on HBASE-21035:
---

What if you lose some edits of meta region and then bring the meta region 
online?

There are some assumption in the code, the renaming of wal directory is done in 
SCP so we can make sure that there is a SCP if the wal directory is already 
ended with '-splitting'. If this is not the case, we do not know what to do. In 
your case it is that you deleted all the procedures, but this is not the only 
possible case right? The damage is made from outside the system, or from an 
expected behavior, i.e, a serious bug in the code, so the decision should also 
be done outside the system.

I think we can add an admin method to allow operators to submit a SCP for a 
special RS, and also add an option to HBCK, which does the same thing in the 
patch here, scan the wal directory, re-submitting SCP for all the RSes which 
wal directories are ended with '-splitting'. But I'm strongly against adding 
this piece of code in normal the master startup path. It is really dangerous.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-08-10 Thread Hadoop QA (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576233#comment-16576233
 ] 

Hadoop QA commented on HBASE-21035:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m 
13s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green}  0m 
 0s{color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
|| || || || {color:brown} branch-2.0 Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
36s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
43s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
12s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
 5s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m  
2s{color} | {color:green} branch-2.0 passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
29s{color} | {color:green} branch-2.0 passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  3m 
34s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  1m 
42s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  1m 
42s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red}  1m 
11s{color} | {color:red} hbase-server: The patch generated 5 new + 176 
unchanged - 0 fixed = 181 total (was 176) {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:red}-1{color} | {color:red} shadedjars {color} | {color:red}  2m 
57s{color} | {color:red} patch has 10 errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
11m 10s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.5 2.7.4 or 3.0.0. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green}  2m 
11s{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} javadoc {color} | {color:red}  0m 
28s{color} | {color:red} hbase-server generated 1 new + 1 unchanged - 0 fixed = 
2 total (was 1) {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green}104m 
40s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:red}-1{color} | {color:red} asflicense {color} | {color:red}  0m 
18s{color} | {color:red} The patch generated 1 ASF License warnings. {color} |
| {color:black}{color} | {color:black} {color} | {color:black}142m  0s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:6f01af0 |
| JIRA Issue | HBASE-21035 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12935119/HBASE-21035.branch-2.0.001.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 0a4799b4bac1 3.13.0-143-generic #192-Ubuntu SMP Tue Feb 27 
10:45:36 UTC 2018 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/component/dev-support/hbase-personality.sh
 |
| git revision | branch-2.0 / 7ee4aa459c |
| maven | version: Apache Maven 3.5.4 
(1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T18:33:14Z) |
| Default Java | 1.8.0_171 |
| findbugs | v3.1.0-RC3 |
| checkstyle | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14001/artifact/patchprocess/diff-checkstyle-hbase-server.txt
 |
| shadedjars | 
https://builds.apache.org/job/PreCommit-HBASE-Build/14001/artifact/patchprocess/patch-shadedjars.txt
 |
| javadoc | 
https://builds.apac

[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-08-10 Thread Allan Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576240#comment-16576240
 ] 

Allan Yang commented on HBASE-21035:


If scheduling a SCP for servers with '-splitting' is not a good idea, then we 
can go around. Before HBASE-20708, there is a method called 
processofflineServersWithOnlineRegions which will schedule a assign procedure 
for any regions on a dead server. But after HBASE-20708 there isn't( replaced 
by processOfflineRegions). Can we just bring the logic in 
processofflineServersWithOnlineRegions back? What I want is the same behave w/ 
or wo/ HBASE-20708.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (HBASE-21035) Meta Table should be able to online even if all procedures are lost

2018-08-10 Thread Duo Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/HBASE-21035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576279#comment-16576279
 ] 

Duo Zhang commented on HBASE-21035:
---

The goal for HBASE-20708 is to remove the unnecessary scheduling for SCPs, and 
also remove the usage of RecoverMetaProcedure, if you want to them back then 
HBASE-20708 is useless...

I still stand my point, this is not the normal case as it breaks our 
assumptions, we can provide tools for operators to override these errors, the 
operators will take their own risk, but we should not try to address them in 
the normal code path.

And if you think the procedure wal is not stable which may lead to corruption 
files, please start to make it stable. And also we may introduce something like 
a backup for the procedure wals to prevent manually damages.

> Meta Table should be able to online even if all procedures are lost
> ---
>
> Key: HBASE-21035
> URL: https://issues.apache.org/jira/browse/HBASE-21035
> Project: HBase
>  Issue Type: Sub-task
>Affects Versions: 2.1.0
>Reporter: Allan Yang
>Assignee: Allan Yang
>Priority: Major
> Attachments: HBASE-21035.branch-2.0.001.patch
>
>
> After HBASE-20708, we changed the way we init after master starts. It will 
> only check WAL dirs and compare to Zookeeper RS nodes to decide which server 
> need to expire. For servers which's dir is ending with 'SPLITTING', we assure 
> that there will be a SCP for it.
> But, if the server with the meta region crashed before master restarts, and 
> if all the procedure wals are lost (due to bug, or deleted manually, 
> whatever), the new restarted master will be stuck when initing. Since no one 
> will bring meta region online.
> Although it is an anomaly case, but I think no matter what happens, we need 
> to online meta region. Otherwise, we are sitting ducks, noting can be done.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)