[ https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292901#comment-16292901 ]
stack commented on HBASE-19457: ------------------------------- Loads of our Procedures are written as Procedures spawning subprocedures so we know it 'works'. bq. But AM starts it's own assign procs first (in recovery phase itself before joining cluster, ie. other procs can run), and they too get stuck somehow. One note is that it is ok if two assigns scheduled. The second will notice the successful first one and then finish. bq. But AM starts it's own assign procs first (in recovery phase itself before joining cluster, ie. other procs can run), and they too get stuck somehow. We could look at a log together? bq. AM only assigns offline regions if table is marked enabled. You are right... } else if (regionNode.getState() == State.OFFLINE) { if (isTableEnabled(regionNode.getTable())) { offlineRegionsToAssign.add(regionNode.getRegionInfo()); ... bq. We can easily solve the issue here by marking table as disabled. As a step in truncate before we create the new? Wonder why this needs it and CreateTable doesnt (I think you ask this above). bq. We should probably change TSM to assume tables with empty state as disabled. Hmm. Will complicate rolling upgrade. I like your questions on the end. They are questions about how the state machine should work. There should be no fuzzyness around states. Plainly there is going by your work here. Lets fix. New issue? > Debugging flaky > TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits > --------------------------------------------------------------------------------------- > > Key: HBASE-19457 > URL: https://issues.apache.org/jira/browse/HBASE-19457 > Project: HBase > Issue Type: Bug > Reporter: Appy > Assignee: Appy > Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt > > > Trying to explain the bug in a more general way where understanding of > ProcedureV2 is not required. > Truncating table operation: > .... > delete region states from meta > delete table state from meta > .... > add new regions to meta with state null. > ....crash > ....recovery: TableStateManager treats table with null state as ENABLED. AM > treats regions with null state as offline. Combined result - AM starts > assigning the new regions from incomplete truncate operation. > Fix: Mark table as disabled instead of deleting it's state. > ---- > *patch1* > Just added some logging to help with debugging: > - 60s was too less time, increased timeout > - Added some useful log statements -- This message was sent by Atlassian JIRA (v6.4.14#64029)