[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-26 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303992#comment-16303992
 ] 

Appy commented on HBASE-19457:
--

Fixed now. But quoting the comment from other jira 
(https://issues.apache.org/jira/browse/HBASE-19530?focusedCommentId=16293663=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16293663).

{quote}
Btw, i predict that this patch might make "fix" TestTruncateTableProcedure.
I quote fix because, those failures are result of two assumption collectively 
resulting in a failure (region state = null --> assume OFFLINE, table state = 
null --> assume ENABLED).
This will break the first one and test might start passing.
But we still need to address the second one, and that will be done in 
HBASE-19529.
{quote}

Closing this one.

> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>
> Trying to explain the bug in a more general way where understanding of 
> ProcedureV2 is not required.
> Truncating table operation:
> 
> delete region states from meta
> delete table state from meta
> 
> add new regions to meta with state null.
> crash
> recovery: TableStateManager treats table with null state as ENABLED. AM 
> treats regions with null state as offline. Combined result - AM starts 
> assigning the new regions from incomplete truncate operation.
> Fix: Mark table as disabled instead of deleting it's state.
> 
> *patch1*
> Just added some logging to help with debugging:
> - 60s was too less time, increased timeout
> - Added some useful log statements



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-18 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295907#comment-16295907
 ] 

Appy commented on HBASE-19457:
--

Still failing, less frequently on apache infra, but much more frequently on GCE 
infra. Will take a look.

> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>
> Trying to explain the bug in a more general way where understanding of 
> ProcedureV2 is not required.
> Truncating table operation:
> 
> delete region states from meta
> delete table state from meta
> 
> add new regions to meta with state null.
> crash
> recovery: TableStateManager treats table with null state as ENABLED. AM 
> treats regions with null state as offline. Combined result - AM starts 
> assigning the new regions from incomplete truncate operation.
> Fix: Mark table as disabled instead of deleting it's state.
> 
> *patch1*
> Just added some logging to help with debugging:
> - 60s was too less time, increased timeout
> - Added some useful log statements



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-15 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293435#comment-16293435
 ] 

stack commented on HBASE-19457:
---

I can't find a case of three tiers of proc. Would need to try it.  I don't see 
why not.

Yeah, its turning up some greens now when it used to be solid red: 
https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/lastSuccessfulBuild/artifact/dashboard.html

The new failure types -- 2 out 5 -- seem different. As you say, lets keep an 
eye on it.

Thanks for new JIRAs. Yeah, lets sort out state in meta.

> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>
> Trying to explain the bug in a more general way where understanding of 
> ProcedureV2 is not required.
> Truncating table operation:
> 
> delete region states from meta
> delete table state from meta
> 
> add new regions to meta with state null.
> crash
> recovery: TableStateManager treats table with null state as ENABLED. AM 
> treats regions with null state as offline. Combined result - AM starts 
> assigning the new regions from incomplete truncate operation.
> Fix: Mark table as disabled instead of deleting it's state.
> 
> *patch1*
> Just added some logging to help with debugging:
> - 60s was too less time, increased timeout
> - Added some useful log statements



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-15 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293380#comment-16293380
 ] 

Appy commented on HBASE-19457:
--

We discussed few things, here's the summary:
- we have procs spawing subprocs, but not sure if there's an example where this 
tree's depth > 2. If yes, we can change truncate proc to just delete proc + 
create proc.

bq. As a step in truncate before we create the new? Wonder why this needs it 
and CreateTable doesnt (I think you ask this above).
Both have ADD_TO_META step where they add regions to meta. But when we fail 
after that:
in case of truncate proc, there's a table row in meta with state null --> gets 
assumed as enabled --> AM starts interfering
in case of create proc, there's no table row at all --> AM ignores those new 
regions

New stuff:
Stack recently committed HBASE-18946 which fixes issues around balancer and 
assigning. After it went in, we see more greens for TestTruncateTableProcedure 
in flaky dashboard.
A word on that:
When AM interfered on recovery (see "...recovery: TableStateManager treats 
table with null state as ENABLED. AM treats regions with null state as offline. 
Combined result - AM starts assigning the new " in description), it started 
Assign procs. But they got stuck for some reason (which i didn't care to debug 
as part of this test fix since it's unrelated). His patch makes that case 
better.
But the real fix here should be to correctly handle state in TTP so that AM 
doesn't interfere.

We'll keep an eye on dashboard, see the new failures, and then decide verdict 
on this patch.

In meantime opened this new jira to discuss other questions HBASE-19529, 
HBASE-19530

> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>
> Trying to explain the bug in a more general way where understanding of 
> ProcedureV2 is not required.
> Truncating table operation:
> 
> delete region states from meta
> delete table state from meta
> 
> add new regions to meta with state null.
> crash
> recovery: TableStateManager treats table with null state as ENABLED. AM 
> treats regions with null state as offline. Combined result - AM starts 
> assigning the new regions from incomplete truncate operation.
> Fix: Mark table as disabled instead of deleting it's state.
> 
> *patch1*
> Just added some logging to help with debugging:
> - 60s was too less time, increased timeout
> - Added some useful log statements



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-15 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292901#comment-16292901
 ] 

stack commented on HBASE-19457:
---

Loads of our Procedures are written as Procedures spawning subprocedures so we 
know it 'works'.


bq.  But AM starts it's own assign procs first (in recovery phase itself before 
joining cluster, ie. other procs can run), and they too get stuck somehow.

One note is that it is ok if two assigns scheduled. The second will notice the 
successful first one and then finish.

bq. But AM starts it's own assign procs first (in recovery phase itself before 
joining cluster, ie. other procs can run), and they too get stuck somehow.

We could look at a log together?

bq. AM only assigns offline regions if table is marked enabled.

You are right...

  } else if (regionNode.getState() == State.OFFLINE) {
if (isTableEnabled(regionNode.getTable())) {
  offlineRegionsToAssign.add(regionNode.getRegionInfo());
...

bq. We can easily solve the issue here by marking table as disabled.

As a step in truncate before we create the new? Wonder why this needs it and 
CreateTable doesnt (I think you ask this above).

bq. We should probably change TSM to assume tables with empty state as disabled.

Hmm. Will complicate rolling upgrade.

I like your questions on the end. They are questions about how the state 
machine should work. There should be no fuzzyness around states. Plainly there 
is going by your work here. Lets fix. New issue?




> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>
> Trying to explain the bug in a more general way where understanding of 
> ProcedureV2 is not required.
> Truncating table operation:
> 
> delete region states from meta
> delete table state from meta
> 
> add new regions to meta with state null.
> crash
> recovery: TableStateManager treats table with null state as ENABLED. AM 
> treats regions with null state as offline. Combined result - AM starts 
> assigning the new regions from incomplete truncate operation.
> Fix: Mark table as disabled instead of deleting it's state.
> 
> *patch1*
> Just added some logging to help with debugging:
> - 60s was too less time, increased timeout
> - Added some useful log statements



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-15 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292170#comment-16292170
 ] 

Appy commented on HBASE-19457:
--

bq. Dang. Why is this Truncate Table not calling DeleteTable then CreateTable 
as subprocedures? Why is it dup'ing procedure body?
Had same thought during debugging. Maybe the answer lies in Pv2 being able to 
handle trees of proc-subprocs. Given design around rootProcid, i think that was 
the goal, but not sure of it's status.
At this point, instead of digging into Pv2 design to figure that out seemed 
waste of time  since
- if it's complete, we still probably shouldn't change things close to release
- if not complete, we can't invest time to finish it before
- Internal stuff, can be done in 2.1
- More important things are there than this :)

bq. If a crash puts us into a whack state such that on resumption we do the 
wrong thing, then the Procedure is not written properly.
It's was not managing state correctly. I want to try this one line patch 
because it should fix it.

bq. What is wrong about when it goes to assign? Is it that we have not finished 
editing/adding all regions to hbase:meta?
All regions are added to meta. But AM starts it's own assign procs first (in 
recovery phase itself before joining cluster, ie. other procs can run), and 
they too get stuck somehow.

AM only assigns offline regions if table is marked enabled.
It's two assumptions together which leads to wrong behavior here.
AM assumes regions with empty state are offline. TableStateManager (TSM) 
assumes table with empty state is enabled.
When AM recovers, it starts assigning.
We can easily solve the issue here by marking table as disabled.

In the end it's these three things:
We should probably change TSM to assume tables with empty state as disabled.
Always add new regions as CLOSED.
And to tie last loose end, decide if region empty null means offline or closed. 



> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>
> Trying to explain the bug in a more general way where understanding of 
> ProcedureV2 is not required.
> Truncating table operation:
> 
> delete region states from meta
> delete table state from meta
> 
> add new regions to meta with state null.
> crash
> recovery: TableStateManager treats table with null state as ENABLED. AM 
> treats regions with null state as offline. Combined result - AM starts 
> assigning the new regions from incomplete truncate operation.
> Fix: Mark table as disabled instead of deleting it's state.
> 
> *patch1*
> Just added some logging to help with debugging:
> - 60s was too less time, increased timeout
> - Added some useful log statements



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-14 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292097#comment-16292097
 ] 

stack commented on HBASE-19457:
---

Good one Appy. There are pieces that still need paving over. Looks like you 
found one (I'm currently working on another).

When we truncate, we delete the table and its regions from hbase:meta or do we 
just edit state? (Looks like we delete the regions... good).

Dang. Why is this Truncate Table not calling DeleteTable then CreateTable as 
subprocedures? Why is it dup'ing procedure body?

If a crash puts us into a whack state such that on resumption we do the wrong 
thing, then the Procedure is not written properly. 

What is wrong about when it goes to assign? Is it that we have not finished 
editing/adding all regions to hbase:meta?

I've been working on Master startup. It reads meta and if it finds regions in 
OPEN state, it will reassign them trying to retain their old locations. It will 
also assign regions that are OFFLINE which thinking about it now is NOT what we 
want.

Who is doing the assign of regions with empty state?

(Can talk tomorrow boss)

> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>
> Trying to explain the bug in a more general way where understanding of 
> ProcedureV2 is not required.
> Truncating table operation:
> 
> delete region states from meta
> delete table state from meta
> 
> add new regions to meta with state null.
> crash
> recovery: TableStateManager treats table with null state as ENABLED. AM 
> treats regions with null state as offline. Combined result - AM starts 
> assigning the new regions from incomplete truncate operation.
> Fix: Mark table as disabled instead of deleting it's state.
> 
> *patch1*
> Just added some logging to help with debugging:
> - 60s was too less time, increased timeout
> - Added some useful log statements



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-14 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291441#comment-16291441
 ] 

Appy commented on HBASE-19457:
--

Let me put it in a more general way where understanding of ProcedureV2 is not 
required.

Truncating table:

delete region states from meta
delete table state from meta

add new regions to meta with state null.
crash
recovery: TableStateManager treats table with null state as ENABLED. AM 
treats regions with null state as offline. Combined result - AM starts 
assigning the new regions from incomplete truncate operation.

Fix: Mark table as disabled instead of deleting it's state.


> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-14 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291183#comment-16291183
 ] 

Appy commented on HBASE-19457:
--

Sure thing Duo. Thanks nevertheless!

> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290793#comment-16290793
 ] 

Hadoop QA commented on HBASE-19457:
---

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue}  0m  
8s{color} | {color:blue} Docker mode activated. {color} |
|| || || || {color:brown} Prechecks {color} ||
| {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue}  0m  
0s{color} | {color:blue} Findbugs executables are not available. {color} |
| {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green}  0m  
0s{color} | {color:green} Patch does not have any anti-patterns. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green}  0m  
0s{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:red}-1{color} | {color:red} test4tests {color} | {color:red}  0m  
0s{color} | {color:red} The patch doesn't appear to include any new or modified 
tests. Please justify why no new tests are needed for this patch. Also please 
list what manual steps were performed to verify this patch. {color} |
|| || || || {color:brown} master Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
37s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
53s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
 8s{color} | {color:green} master passed {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  6m 
 9s{color} | {color:green} branch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
36s{color} | {color:green} master passed {color} |
|| || || || {color:brown} Patch Compile Tests {color} ||
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green}  5m 
26s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green}  0m 
48s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green}  1m 
10s{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green}  0m 
 0s{color} | {color:green} The patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} shadedjars {color} | {color:green}  4m 
53s{color} | {color:green} patch has no errors when building our shaded 
downstream artifacts. {color} |
| {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 
22m 21s{color} | {color:green} Patch does not cause any errors with Hadoop 
2.6.5 2.7.4 or 3.0.0-beta1. {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green}  0m 
35s{color} | {color:green} the patch passed {color} |
|| || || || {color:brown} Other Tests {color} ||
| {color:green}+1{color} | {color:green} unit {color} | {color:green} 98m 
27s{color} | {color:green} hbase-server in the patch passed. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green}  0m 
17s{color} | {color:green} The patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black}142m 42s{color} | 
{color:black} {color} |
\\
\\
|| Subsystem || Report/Notes ||
| Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 |
| JIRA Issue | HBASE-19457 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12902038/HBASE-19457.master.001.patch
 |
| Optional Tests |  asflicense  javac  javadoc  unit  findbugs  shadedjars  
hadoopcheck  hbaseanti  checkstyle  compile  |
| uname | Linux 7bde35b740b5 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 
15:49:21 UTC 2017 x86_64 GNU/Linux |
| Build tool | maven |
| Personality | 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh
 |
| git revision | master / 7466e64abb |
| maven | version: Apache Maven 3.5.2 
(138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) |
| Default Java | 1.8.0_151 |
|  Test Results | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10442/testReport/ |
| modules | C: hbase-server U: hbase-server |
| Console output | 
https://builds.apache.org/job/PreCommit-HBASE-Build/10442/console |
| Powered by | Apache Yetus 0.6.0   http://yetus.apache.org |


This message was automatically generated.



> Debugging flaky 
> 

[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-14 Thread Duo Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290698#comment-16290698
 ] 

Duo Zhang commented on HBASE-19457:
---

I'm not very familiar with the current AM yet, but I know it is complicated 
when reading the procedure2 related code...

So let's wait for the boss [~stack].

> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-14 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290681#comment-16290681
 ] 

Appy commented on HBASE-19457:
--

Ping [~stack], [~Apache9].

> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits

2017-12-14 Thread Appy (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290632#comment-16290632
 ] 

Appy commented on HBASE-19457:
--

After more debugging, i think i finally have fix (sorry for being slow, just 
beginning to understand AM).

So the issue is,
We delete table's state from meta (in step [TRUNCATE_TABLE_REMOVE_FROM_META 
|https://github.com/apache/hbase/blob/7466e64abb2c68c8a0f40f6051e4b5bf550e69bd/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/TruncateTableProcedure.java#L102])
On recovery, TableStateManager#fixTableStates assumes that missing state means 
enabled table is enabled. 
([here|https://github.com/apache/hbase/blob/7466e64abb2c68c8a0f40f6051e4b5bf550e69bd/hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java#L218])
 
Later we add regions to meta and crash after that. On recovery, AM sees these 
regions, looks for table state and finds it enabled, and starts assigning them 
and screws up.

Simple fix here would be: Don't delete table state from meta, just let it 
remain DISABLED.
---

But CreateTableProcedure also adds regions to meta and crashes. Why don't we 
see same issue there?
It adds region row to meta, but does not add any row for the table. 
On recovery, when AM looks for table state corresponding to those regions, 
TSM#getTableState() throws TableNotFoundException, which get's caught 
[here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java#L135]..etc
 etc
End result being, it ignores those regions.



Some bigger questions to ponder:
1) Should we really assume missing state column as enabled? Probably assuming 
disabled is more conservative and better choice? Won't screws up the cluster. 
(Only other place delete the state column is hbck)
2) Shouldn't new regions always be added with state closed? (dev thread: 
http://mail-archives.apache.org/mod_mbox/hbase-dev/201712.mbox/browser)

> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---
>
> Key: HBASE-19457
> URL: https://issues.apache.org/jira/browse/HBASE-19457
> Project: HBase
>  Issue Type: Bug
>Reporter: Appy
>Assignee: Appy
> Attachments: patch1, test-output.txt
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)