[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292901#comment-16292901
 ] 

stack commented on HBASE-19457:
-------------------------------

Loads of our Procedures are written as Procedures spawning subprocedures so we 
know it 'works'.


bq.  But AM starts it's own assign procs first (in recovery phase itself before 
joining cluster, ie. other procs can run), and they too get stuck somehow.

One note is that it is ok if two assigns scheduled. The second will notice the 
successful first one and then finish.

bq. But AM starts it's own assign procs first (in recovery phase itself before 
joining cluster, ie. other procs can run), and they too get stuck somehow.

We could look at a log together?

bq. AM only assigns offline regions if table is marked enabled.

You are right...

      } else if (regionNode.getState() == State.OFFLINE) {
        if (isTableEnabled(regionNode.getTable())) {
          offlineRegionsToAssign.add(regionNode.getRegionInfo());
...

bq. We can easily solve the issue here by marking table as disabled.

As a step in truncate before we create the new? Wonder why this needs it and 
CreateTable doesnt (I think you ask this above).

bq. We should probably change TSM to assume tables with empty state as disabled.

Hmm. Will complicate rolling upgrade.

I like your questions on the end. They are questions about how the state 
machine should work. There should be no fuzzyness around states. Plainly there 
is going by your work here. Lets fix. New issue?




> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-19457
>                 URL: https://issues.apache.org/jira/browse/HBASE-19457
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Appy
>            Assignee: Appy
>         Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>
> Trying to explain the bug in a more general way where understanding of 
> ProcedureV2 is not required.
> Truncating table operation:
> ....
> delete region states from meta
> delete table state from meta
> ....
> add new regions to meta with state null.
> ....crash
> ....recovery: TableStateManager treats table with null state as ENABLED. AM 
> treats regions with null state as offline. Combined result - AM starts 
> assigning the new regions from incomplete truncate operation.
> Fix: Mark table as disabled instead of deleting it's state.
> ----
> *patch1*
> Just added some logging to help with debugging:
> - 60s was too less time, increased timeout
> - Added some useful log statements



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to