[
https://issues.apache.org/jira/browse/HBASE-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207785#comment-13207785
]
ramkrishna.s.vasudevan commented on HBASE-5200:
-----------------------------------------------
@Stack
First of all thanks for the testcase.
{code}
Would the fb trick of NOT processing callbacks during master failover help
here? At least for the scope of the AM.joinCluster?
{code}
This part i did not go through as i did not find time.
{code}
. Isn't possible that in processRegionInTransition we may have done this
already?
{code}
The check in handleRegion or in processRegionInTransition will be exclusive.
It will be done only in one place.
{code}
It seems wrong that we are putting stuff into RIT in two places; in
processRegionsInTransition and in handlRegion if we happen to be fielding a
call back before failover has had a chance to run.
{code}
Though we do this in two places either procesRIIT or handleREgion only will
execute thus the RIT population is neeeded to help process the current flow.
{code}
applied to TRUNK.
However, TestAssignmentManager#testBalanceOnMasterFailover fails with or
without the patch.
{code}
The test case had few problems.
-> The region was not transitioned after the CLOSED transition got a call back
for assigning it. So there was no RS to process the assign.
-> the gate variable was not getting reset.
-> One more thing is we will get a call back only after we do the
ZKAssign.getDataandWatch.
But in testcase we were getting a call back just after am.joinCluster. So i
have done some modifications.
Once again thanks for the test case which helped to verify the scenarios.
Please provide your suggestions.
The FB approach i need some time if we have to check that and implement here.
> AM.ProcessRegionInTransition() and AM.handleRegion() race thus leaving the
> region assignment inconsistent
> ---------------------------------------------------------------------------------------------------------
>
> Key: HBASE-5200
> URL: https://issues.apache.org/jira/browse/HBASE-5200
> Project: HBase
> Issue Type: Bug
> Affects Versions: 0.90.5
> Reporter: ramkrishna.s.vasudevan
> Assignee: ramkrishna.s.vasudevan
> Fix For: 0.94.0, 0.90.7, 0.92.1
>
> Attachments: 5200-test.txt, 5200-v2.txt, HBASE-5200.patch,
> HBASE-5200_1.patch, HBASE-5200_trunk_latest_with_test_2.patch,
> TEST-org.apache.hadoop.hbase.master.TestRestartCluster.xml,
> hbase-5200_90_latest.patch
>
>
> This is the scenario
> Consider a case where the balancer is going on thus trying to close regions
> in a RS.
> Before we could close a master switch happens.
> On Master switch the set of nodes that are in RIT is collected and we first
> get Data and start watching the node
> After that the node data is added into RIT.
> Now by this time (before adding to RIT) if the RS to which close was called
> does a transition in AM.handleRegion() we miss the handling saying RIT state
> was null.
> {code}
> 2012-01-13 10:50:46,358 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region
> a66d281d231dfcaea97c270698b26b6f from server
> HOST-192-168-47-205,20020,1326363111288 but region was in the state null and
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,358 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region
> c12e53bfd48ddc5eec507d66821c4d23 from server
> HOST-192-168-47-205,20020,1326363111288 but region was in the state null and
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,358 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region
> 59ae13de8c1eb325a0dd51f4902d2052 from server
> HOST-192-168-47-205,20020,1326363111288 but region was in the state null and
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region
> f45bc9614d7575f35244849af85aa078 from server
> HOST-192-168-47-205,20020,1326363111288 but region was in the state null and
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region
> cc3ecd7054fe6cd4a1159ed92fd62641 from server
> HOST-192-168-47-204,20020,1326342744518 but region was in the state null and
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region
> 3af40478a17fee96b4a192b22c90d5a2 from server
> HOST-192-168-47-205,20020,1326363111288 but region was in the state null and
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region
> e6096a8466e730463e10d3d61f809b92 from server
> HOST-192-168-47-204,20020,1326342744518 but region was in the state null and
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region
> 4806781a1a23066f7baed22b4d237e24 from server
> HOST-192-168-47-204,20020,1326342744518 but region was in the state null and
> not in expected PENDING_CLOSE or CLOSING states
> 2012-01-13 10:50:46,359 WARN
> org.apache.hadoop.hbase.master.AssignmentManager: Received CLOSED for region
> d69e104131accaefe21dcc01fddc7629 from server
> HOST-192-168-47-205,20020,1326363111288 but region was in the state null and
> not in expected PENDING_CLOSE or CLOSING states
> {code}
> In branch the CLOSING node is created by RS thus leading to more
> inconsistency.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira