[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16303992#comment-16303992 ] Appy commented on HBASE-19457: -- Fixed now. But quoting the comment from other jira (https://issues.apache.org/jira/browse/HBASE-19530?focusedCommentId=16293663=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16293663). {quote} Btw, i predict that this patch might make "fix" TestTruncateTableProcedure. I quote fix because, those failures are result of two assumption collectively resulting in a failure (region state = null --> assume OFFLINE, table state = null --> assume ENABLED). This will break the first one and test might start passing. But we still need to address the second one, and that will be done in HBASE-19529. {quote} Closing this one. > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > > Trying to explain the bug in a more general way where understanding of > ProcedureV2 is not required. > Truncating table operation: > > delete region states from meta > delete table state from meta > > add new regions to meta with state null. > crash > recovery: TableStateManager treats table with null state as ENABLED. AM > treats regions with null state as offline. Combined result - AM starts > assigning the new regions from incomplete truncate operation. > Fix: Mark table as disabled instead of deleting it's state. > > *patch1* > Just added some logging to help with debugging: > - 60s was too less time, increased timeout > - Added some useful log statements -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295907#comment-16295907 ] Appy commented on HBASE-19457: -- Still failing, less frequently on apache infra, but much more frequently on GCE infra. Will take a look. > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > > Trying to explain the bug in a more general way where understanding of > ProcedureV2 is not required. > Truncating table operation: > > delete region states from meta > delete table state from meta > > add new regions to meta with state null. > crash > recovery: TableStateManager treats table with null state as ENABLED. AM > treats regions with null state as offline. Combined result - AM starts > assigning the new regions from incomplete truncate operation. > Fix: Mark table as disabled instead of deleting it's state. > > *patch1* > Just added some logging to help with debugging: > - 60s was too less time, increased timeout > - Added some useful log statements -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293435#comment-16293435 ] stack commented on HBASE-19457: --- I can't find a case of three tiers of proc. Would need to try it. I don't see why not. Yeah, its turning up some greens now when it used to be solid red: https://builds.apache.org/view/H-L/view/HBase/job/HBase-Find-Flaky-Tests/lastSuccessfulBuild/artifact/dashboard.html The new failure types -- 2 out 5 -- seem different. As you say, lets keep an eye on it. Thanks for new JIRAs. Yeah, lets sort out state in meta. > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > > Trying to explain the bug in a more general way where understanding of > ProcedureV2 is not required. > Truncating table operation: > > delete region states from meta > delete table state from meta > > add new regions to meta with state null. > crash > recovery: TableStateManager treats table with null state as ENABLED. AM > treats regions with null state as offline. Combined result - AM starts > assigning the new regions from incomplete truncate operation. > Fix: Mark table as disabled instead of deleting it's state. > > *patch1* > Just added some logging to help with debugging: > - 60s was too less time, increased timeout > - Added some useful log statements -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16293380#comment-16293380 ] Appy commented on HBASE-19457: -- We discussed few things, here's the summary: - we have procs spawing subprocs, but not sure if there's an example where this tree's depth > 2. If yes, we can change truncate proc to just delete proc + create proc. bq. As a step in truncate before we create the new? Wonder why this needs it and CreateTable doesnt (I think you ask this above). Both have ADD_TO_META step where they add regions to meta. But when we fail after that: in case of truncate proc, there's a table row in meta with state null --> gets assumed as enabled --> AM starts interfering in case of create proc, there's no table row at all --> AM ignores those new regions New stuff: Stack recently committed HBASE-18946 which fixes issues around balancer and assigning. After it went in, we see more greens for TestTruncateTableProcedure in flaky dashboard. A word on that: When AM interfered on recovery (see "...recovery: TableStateManager treats table with null state as ENABLED. AM treats regions with null state as offline. Combined result - AM starts assigning the new " in description), it started Assign procs. But they got stuck for some reason (which i didn't care to debug as part of this test fix since it's unrelated). His patch makes that case better. But the real fix here should be to correctly handle state in TTP so that AM doesn't interfere. We'll keep an eye on dashboard, see the new failures, and then decide verdict on this patch. In meantime opened this new jira to discuss other questions HBASE-19529, HBASE-19530 > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > > Trying to explain the bug in a more general way where understanding of > ProcedureV2 is not required. > Truncating table operation: > > delete region states from meta > delete table state from meta > > add new regions to meta with state null. > crash > recovery: TableStateManager treats table with null state as ENABLED. AM > treats regions with null state as offline. Combined result - AM starts > assigning the new regions from incomplete truncate operation. > Fix: Mark table as disabled instead of deleting it's state. > > *patch1* > Just added some logging to help with debugging: > - 60s was too less time, increased timeout > - Added some useful log statements -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292901#comment-16292901 ] stack commented on HBASE-19457: --- Loads of our Procedures are written as Procedures spawning subprocedures so we know it 'works'. bq. But AM starts it's own assign procs first (in recovery phase itself before joining cluster, ie. other procs can run), and they too get stuck somehow. One note is that it is ok if two assigns scheduled. The second will notice the successful first one and then finish. bq. But AM starts it's own assign procs first (in recovery phase itself before joining cluster, ie. other procs can run), and they too get stuck somehow. We could look at a log together? bq. AM only assigns offline regions if table is marked enabled. You are right... } else if (regionNode.getState() == State.OFFLINE) { if (isTableEnabled(regionNode.getTable())) { offlineRegionsToAssign.add(regionNode.getRegionInfo()); ... bq. We can easily solve the issue here by marking table as disabled. As a step in truncate before we create the new? Wonder why this needs it and CreateTable doesnt (I think you ask this above). bq. We should probably change TSM to assume tables with empty state as disabled. Hmm. Will complicate rolling upgrade. I like your questions on the end. They are questions about how the state machine should work. There should be no fuzzyness around states. Plainly there is going by your work here. Lets fix. New issue? > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > > Trying to explain the bug in a more general way where understanding of > ProcedureV2 is not required. > Truncating table operation: > > delete region states from meta > delete table state from meta > > add new regions to meta with state null. > crash > recovery: TableStateManager treats table with null state as ENABLED. AM > treats regions with null state as offline. Combined result - AM starts > assigning the new regions from incomplete truncate operation. > Fix: Mark table as disabled instead of deleting it's state. > > *patch1* > Just added some logging to help with debugging: > - 60s was too less time, increased timeout > - Added some useful log statements -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292170#comment-16292170 ] Appy commented on HBASE-19457: -- bq. Dang. Why is this Truncate Table not calling DeleteTable then CreateTable as subprocedures? Why is it dup'ing procedure body? Had same thought during debugging. Maybe the answer lies in Pv2 being able to handle trees of proc-subprocs. Given design around rootProcid, i think that was the goal, but not sure of it's status. At this point, instead of digging into Pv2 design to figure that out seemed waste of time since - if it's complete, we still probably shouldn't change things close to release - if not complete, we can't invest time to finish it before - Internal stuff, can be done in 2.1 - More important things are there than this :) bq. If a crash puts us into a whack state such that on resumption we do the wrong thing, then the Procedure is not written properly. It's was not managing state correctly. I want to try this one line patch because it should fix it. bq. What is wrong about when it goes to assign? Is it that we have not finished editing/adding all regions to hbase:meta? All regions are added to meta. But AM starts it's own assign procs first (in recovery phase itself before joining cluster, ie. other procs can run), and they too get stuck somehow. AM only assigns offline regions if table is marked enabled. It's two assumptions together which leads to wrong behavior here. AM assumes regions with empty state are offline. TableStateManager (TSM) assumes table with empty state is enabled. When AM recovers, it starts assigning. We can easily solve the issue here by marking table as disabled. In the end it's these three things: We should probably change TSM to assume tables with empty state as disabled. Always add new regions as CLOSED. And to tie last loose end, decide if region empty null means offline or closed. > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > > Trying to explain the bug in a more general way where understanding of > ProcedureV2 is not required. > Truncating table operation: > > delete region states from meta > delete table state from meta > > add new regions to meta with state null. > crash > recovery: TableStateManager treats table with null state as ENABLED. AM > treats regions with null state as offline. Combined result - AM starts > assigning the new regions from incomplete truncate operation. > Fix: Mark table as disabled instead of deleting it's state. > > *patch1* > Just added some logging to help with debugging: > - 60s was too less time, increased timeout > - Added some useful log statements -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16292097#comment-16292097 ] stack commented on HBASE-19457: --- Good one Appy. There are pieces that still need paving over. Looks like you found one (I'm currently working on another). When we truncate, we delete the table and its regions from hbase:meta or do we just edit state? (Looks like we delete the regions... good). Dang. Why is this Truncate Table not calling DeleteTable then CreateTable as subprocedures? Why is it dup'ing procedure body? If a crash puts us into a whack state such that on resumption we do the wrong thing, then the Procedure is not written properly. What is wrong about when it goes to assign? Is it that we have not finished editing/adding all regions to hbase:meta? I've been working on Master startup. It reads meta and if it finds regions in OPEN state, it will reassign them trying to retain their old locations. It will also assign regions that are OFFLINE which thinking about it now is NOT what we want. Who is doing the assign of regions with empty state? (Can talk tomorrow boss) > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > > Trying to explain the bug in a more general way where understanding of > ProcedureV2 is not required. > Truncating table operation: > > delete region states from meta > delete table state from meta > > add new regions to meta with state null. > crash > recovery: TableStateManager treats table with null state as ENABLED. AM > treats regions with null state as offline. Combined result - AM starts > assigning the new regions from incomplete truncate operation. > Fix: Mark table as disabled instead of deleting it's state. > > *patch1* > Just added some logging to help with debugging: > - 60s was too less time, increased timeout > - Added some useful log statements -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291441#comment-16291441 ] Appy commented on HBASE-19457: -- Let me put it in a more general way where understanding of ProcedureV2 is not required. Truncating table: delete region states from meta delete table state from meta add new regions to meta with state null. crash recovery: TableStateManager treats table with null state as ENABLED. AM treats regions with null state as offline. Combined result - AM starts assigning the new regions from incomplete truncate operation. Fix: Mark table as disabled instead of deleting it's state. > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16291183#comment-16291183 ] Appy commented on HBASE-19457: -- Sure thing Duo. Thanks nevertheless! > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290793#comment-16290793 ] Hadoop QA commented on HBASE-19457: --- | (x) *{color:red}-1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 8s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:blue}0{color} | {color:blue} findbugs {color} | {color:blue} 0m 0s{color} | {color:blue} Findbugs executables are not available. {color} | | {color:green}+1{color} | {color:green} hbaseanti {color} | {color:green} 0m 0s{color} | {color:green} Patch does not have any anti-patterns. {color} | | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:red}-1{color} | {color:red} test4tests {color} | {color:red} 0m 0s{color} | {color:red} The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color} | || || || || {color:brown} master Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 37s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 53s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 8s{color} | {color:green} master passed {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 6m 9s{color} | {color:green} branch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 36s{color} | {color:green} master passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 5m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 48s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 1m 10s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedjars {color} | {color:green} 4m 53s{color} | {color:green} patch has no errors when building our shaded downstream artifacts. {color} | | {color:green}+1{color} | {color:green} hadoopcheck {color} | {color:green} 22m 21s{color} | {color:green} Patch does not cause any errors with Hadoop 2.6.5 2.7.4 or 3.0.0-beta1. {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 35s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 98m 27s{color} | {color:green} hbase-server in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 17s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black}142m 42s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hbase:eee3b01 | | JIRA Issue | HBASE-19457 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12902038/HBASE-19457.master.001.patch | | Optional Tests | asflicense javac javadoc unit findbugs shadedjars hadoopcheck hbaseanti checkstyle compile | | uname | Linux 7bde35b740b5 3.13.0-133-generic #182-Ubuntu SMP Tue Sep 19 15:49:21 UTC 2017 x86_64 GNU/Linux | | Build tool | maven | | Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build@2/component/dev-support/hbase-personality.sh | | git revision | master / 7466e64abb | | maven | version: Apache Maven 3.5.2 (138edd61fd100ec658bfa2d307c43b76940a5d7d; 2017-10-18T07:58:13Z) | | Default Java | 1.8.0_151 | | Test Results | https://builds.apache.org/job/PreCommit-HBASE-Build/10442/testReport/ | | modules | C: hbase-server U: hbase-server | | Console output | https://builds.apache.org/job/PreCommit-HBASE-Build/10442/console | | Powered by | Apache Yetus 0.6.0 http://yetus.apache.org | This message was automatically generated. > Debugging flaky >
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290698#comment-16290698 ] Duo Zhang commented on HBASE-19457: --- I'm not very familiar with the current AM yet, but I know it is complicated when reading the procedure2 related code... So let's wait for the boss [~stack]. > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290681#comment-16290681 ] Appy commented on HBASE-19457: -- Ping [~stack], [~Apache9]. > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (HBASE-19457) Debugging flaky TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16290632#comment-16290632 ] Appy commented on HBASE-19457: -- After more debugging, i think i finally have fix (sorry for being slow, just beginning to understand AM). So the issue is, We delete table's state from meta (in step [TRUNCATE_TABLE_REMOVE_FROM_META |https://github.com/apache/hbase/blob/7466e64abb2c68c8a0f40f6051e4b5bf550e69bd/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/TruncateTableProcedure.java#L102]) On recovery, TableStateManager#fixTableStates assumes that missing state means enabled table is enabled. ([here|https://github.com/apache/hbase/blob/7466e64abb2c68c8a0f40f6051e4b5bf550e69bd/hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java#L218]) Later we add regions to meta and crash after that. On recovery, AM sees these regions, looks for table state and finds it enabled, and starts assigning them and screws up. Simple fix here would be: Don't delete table state from meta, just let it remain DISABLED. --- But CreateTableProcedure also adds regions to meta and crashes. Why don't we see same issue there? It adds region row to meta, but does not add any row for the table. On recovery, when AM looks for table state corresponding to those regions, TSM#getTableState() throws TableNotFoundException, which get's caught [here|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/TableStateManager.java#L135]..etc etc End result being, it ignores those regions. Some bigger questions to ponder: 1) Should we really assume missing state column as enabled? Probably assuming disabled is more conservative and better choice? Won't screws up the cluster. (Only other place delete the state column is hbck) 2) Shouldn't new regions always be added with state closed? (dev thread: http://mail-archives.apache.org/mod_mbox/hbase-dev/201712.mbox/browser) > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug >Reporter: Appy >Assignee: Appy > Attachments: patch1, test-output.txt > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)