[ 
https://issues.apache.org/jira/browse/HBASE-19457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292170#comment-16292170
 ] 

Appy commented on HBASE-19457:
------------------------------

bq. Dang. Why is this Truncate Table not calling DeleteTable then CreateTable 
as subprocedures? Why is it dup'ing procedure body?
Had same thought during debugging. Maybe the answer lies in Pv2 being able to 
handle trees of proc-subprocs. Given design around rootProcid, i think that was 
the goal, but not sure of it's status.
At this point, instead of digging into Pv2 design to figure that out seemed 
waste of time  since
- if it's complete, we still probably shouldn't change things close to release
- if not complete, we can't invest time to finish it before
- Internal stuff, can be done in 2.1
- More important things are there than this :)

bq. If a crash puts us into a whack state such that on resumption we do the 
wrong thing, then the Procedure is not written properly.
It's was not managing state correctly. I want to try this one line patch 
because it should fix it.

bq. What is wrong about when it goes to assign? Is it that we have not finished 
editing/adding all regions to hbase:meta?
All regions are added to meta. But AM starts it's own assign procs first (in 
recovery phase itself before joining cluster, ie. other procs can run), and 
they too get stuck somehow.

AM only assigns offline regions if table is marked enabled.
It's two assumptions together which leads to wrong behavior here.
AM assumes regions with empty state are offline. TableStateManager (TSM) 
assumes table with empty state is enabled.
When AM recovers, it starts assigning.
We can easily solve the issue here by marking table as disabled.

In the end it's these three things:
We should probably change TSM to assume tables with empty state as disabled.
Always add new regions as CLOSED.
And to tie last loose end, decide if region empty null means offline or closed. 



> Debugging flaky 
> TestTruncateTableProcedure#testRecoveryAndDoubleExecutionPreserveSplits
> ---------------------------------------------------------------------------------------
>
>                 Key: HBASE-19457
>                 URL: https://issues.apache.org/jira/browse/HBASE-19457
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Appy
>            Assignee: Appy
>         Attachments: HBASE-19457.master.001.patch, patch1, test-output.txt
>
>
> Trying to explain the bug in a more general way where understanding of 
> ProcedureV2 is not required.
> Truncating table operation:
> ....
> delete region states from meta
> delete table state from meta
> ....
> add new regions to meta with state null.
> ....crash
> ....recovery: TableStateManager treats table with null state as ENABLED. AM 
> treats regions with null state as offline. Combined result - AM starts 
> assigning the new regions from incomplete truncate operation.
> Fix: Mark table as disabled instead of deleting it's state.
> ----
> *patch1*
> Just added some logging to help with debugging:
> - 60s was too less time, increased timeout
> - Added some useful log statements



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to