[ 
https://issues.apache.org/jira/browse/HBASE-5092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ramkrishna.s.vasudevan updated HBASE-5092:
------------------------------------------

    Comment: was deleted

(was: @Liu

Can we handle RIT exception also to retry the assignment? What do you think?)
    
> Two adjacent assignments lead region is in PENDING_OPEN state and block table 
> disable and enable actions.
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-5092
>                 URL: https://issues.apache.org/jira/browse/HBASE-5092
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 0.92.0
>            Reporter: Liu Jia
>            Assignee: Liu Jia
>         Attachments: unhandled_PENDING_OPEN_lead_by_two_assignment.patch
>
>
>   
> Region is in PENDING_OPEN state and disable and enable are blocked.
> We occasionally find if two assignments which have a short interval time will 
> lead to a PENDING_OPEN state staying in the regionInTransition map and 
> blocking the disable and enable table actions.
> We found that the second assignment will set the zknode of this region to 
> M_ZK_REGION_OFFLINE then set the state in assignmentMananger's 
> regionInTransition map to PENDING_OPEN and abort its further operation 
> because of finding the the region is already in the regionserver by a 
> RegionAlreadyInTransitionException.
> At the same time the first assignment is tickleOpening and find the version 
> of the zknode is messed up by the  second assignment, so the 
> OpenRegionHandler print out the following two lines:
> {noformat} 
> 2011-12-23 22:12:15,197 WARN  [RS_OPEN_REGION-data16,59892,1324649528415-0] 
> zookeeper.ZKAssign(788): regionserver:59892-0x1346b43b91e0002 Attempt to 
> transition the unassigned node for 15237599c632752b8cfd3d5a86349768 from 
> RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENING failed, the node existed but was 
> version 2 not the expected version 1
> 2011-12-23 22:12:15,197 WARN  [RS_OPEN_REGION-data16,59892,1324649528415-0] 
> handler.OpenRegionHandler(403): Failed refreshing OPENING; 
> region=15237599c632752b8cfd3d5a86349768, context=post_region_open
> {noformat} 
> After that it tries to turn the state to FAILED_OPEN, but also failed due to 
> wrong version,
> this is the output:
> {noformat} 
> 2011-12-23 22:12:15,199 WARN  [RS_OPEN_REGION-data16,59892,1324649528415-0] 
> zookeeper.ZKAssign(812): regionserver:59892-0x1346b43b91e0002 Attempt to 
> transition the unassigned node for 15237599c632752b8cfd3d5a86349768 from 
> RS_ZK_REGION_OPENING to RS_ZK_REGION_FAILED_OPEN failed, the node existed but 
> was in the state M_ZK_REGION_OFFLINE set by the server 
> data16,59892,1324649528415
> 2011-12-23 22:12:15,199 WARN  [RS_OPEN_REGION-data16,59892,1324649528415-0] 
> handler.OpenRegionHandler(307): Unable to mark region {NAME => 
> 'table1,,1324649533045.15237599c632752b8cfd3d5a86349768.', STARTKEY => '', 
> ENDKEY => '', ENCODED => 15237599c632752b8cfd3d5a86349768,} as FAILED_OPEN. 
> It's likely that the master already timed out this open attempt, and thus 
> another RS already has the region.
> {noformat} 
> So after all that, the PENDING_OPEN state is left in the assignmentMananger's 
> regionInTransition map and none will deal with it further,
> This kind of situation will wait until the master find the state out of time.
> The following is the test code:
> {code:title=test.java|borderStyle=solid}
> @Test
>   public void testDisableTables() throws IOException {
>     for (int i = 0; i < 20; i++) {
>       HTableDescriptor des = admin.getTableDescriptor(Bytes.toBytes(table1));
>       List<HRegionInfo> hris = TEST_UTIL.getHBaseCluster().getMaster()
>           .getAssignmentManager().getRegionsOfTable(Bytes.toBytes(table1));
>       TEST_UTIL.getHBaseCluster().getMaster()
>           .assign(hris.get(0).getRegionName());
>   
>       TEST_UTIL.getHBaseCluster().getMaster()
>           .assign(hris.get(0).getRegionName());
>   
>       admin.disableTable(Bytes.toBytes(table1));
>       admin.modifyTable(Bytes.toBytes(table1), des);
>       admin.enableTable(Bytes.toBytes(table1));
>     }
>   }
> {code}
> To fix this,we add a line to 
> public static int ZKAssign.transitionNode() to make 
> endState.RS_ZK_REGION_FAILED_OPEN transition pass.
> {code:title=ZKAssign.java|borderStyle=solid}
>    if((!existingData.getEventType().equals(beginState))
>       //add the following line to make endState.RS_ZK_REGION_FAILED_OPEN 
> transition pass.
>       &&(!endState.equals(endState.RS_ZK_REGION_FAILED_OPEN))) {
>       LOG.warn(zkw.prefix("Attempt to transition the " +
>         "unassigned node for " + encoded +
>         " from " + beginState + " to " + endState + " failed, " +
>         "the node existed but was in the state " + 
> existingData.getEventType() +
>         " set by the server " + serverName));
>       return -1;
>     }
> {code}
> Run the test case again we found that before the first assignment trans the 
> state from offline to opening, the second assignment could set the state to 
> offline again and messed up the version of zknode.
> In OpenRegionHandler.process() the following part failed and make the 
> process() return.
> {code:title=OpenRegionHandler.java|borderStyle=solid}
>  if (!transitionZookeeperOfflineToOpening(encodedName,
>           versionOfOfflineNode)) {
>         LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>           encodedName);
>         return;
> {code}      }
> //So we add the following code to the part to make this open region process 
> to FAILED_OPEN.
> {code:title=OpenRegionHandler.java|borderStyle=solid}
>  if (!transitionZookeeperOfflineToOpening(encodedName,
>           versionOfOfflineNode)) {
>         LOG.warn("Region was hijacked? It no longer exists, encodedName=" +
>           encodedName);
>         tryTransitionToFailedOpen(regionInfo);
>         return;
>       }
> {code}
> After the two amendments, two adjacent assignments will not lead to an 
> unhandled PENDING_OPEN state.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to