[ https://issues.apache.org/jira/browse/HBASE-17264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855610#comment-15855610 ]
Hudson commented on HBASE-17264: -------------------------------- FAILURE: Integrated in Jenkins build HBase-1.2-IT #590 (See [https://builds.apache.org/job/HBase-1.2-IT/590/]) HBASE-17264 Processing RIT with offline state will always fail to open (tedyu: rev 27303fdfb7180d8ba8d8241dc7217a35cc310994) * (edit) hbase-server/src/main/java/org/apache/hadoop/hbase/master/AssignmentManager.java > Processing RIT with offline state will always fail to open the first time > ------------------------------------------------------------------------- > > Key: HBASE-17264 > URL: https://issues.apache.org/jira/browse/HBASE-17264 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 1.1.7 > Reporter: Allan Yang > Assignee: Allan Yang > Fix For: 1.4.0, 1.3.1, 1.2.5, 1.1.9 > > Attachments: HBASE-17264-branch-1.1.patch > > > In Assignment#processRegionsInTransition, when handling regions with > M_ZK_REGION_OFFLINE state, we used a handler to reassign this region. But, > when calling assign, we passed not to set the zk node > {code} > case M_ZK_REGION_OFFLINE: > // Insert in RIT and resend to the regionserver > regionStates.updateRegionState(rt, State.PENDING_OPEN); > final RegionState rsOffline = regionStates.getRegionState(regionInfo); > this.executorService.submit( > new EventHandler(server, EventType.M_MASTER_RECOVERY) { > @Override > public void process() throws IOException { > ReentrantLock lock = > locker.acquireLock(regionInfo.getEncodedName()); > try { > RegionPlan plan = new RegionPlan(regionInfo, null, sn); > addPlan(encodedName, plan); > assign(rsOffline, false, false); //we decide to not to > setOfflineInZK > } finally { > lock.unlock(); > } > } > }); > break; > {code} > But, when setOfflineInZK is false, we passed a zk node vesion of -1 to the > regionserver, meaning the zk node does not exists. But actually the offline > zk node does exist with a different version. RegionServer will report fail to > open because of this. > This situation is trully happened in our test environment. Though the master > will recevied the FAILED_OPEN zk event and retry later, but due to a another > bug(HBASE-17265). The Region will be remain in closed state forever. > Master assign region in RIT > {noformat} > 2016-11-23 17:11:46,842 INFO [example.org:30001.activeMasterManager] > master.AssignmentManager: Processing 57513956a7b671f4e8da1598c2e2970e in > state: M_ZK_REGION_OFFLINE > 2016-11-23 17:11:46,842 INFO [example.org:30001.activeMasterManager] > master.RegionStates: Transition {57513956a7b671f4e8da1598c2e2970e > state=OFFLINE, ts=1479892306738, server=example.org,30003,1475893095003} to > {57513956a7b671f4e8da1598c2e2970e state=PENDING_OPEN, ts=1479892306842, > server=example.org,30003,1479780976834} > 2016-11-23 17:11:46,842 INFO [example.org:30001.activeMasterManager] > master.AssignmentManager: Processed region 57513956a7b671f4e8da1598c2e2970e > in state M_ZK_REGION_OFFLINE, on server: example.org,30003,1479780976834 > 2016-11-23 17:11:46,843 INFO [MASTER_SERVER_OPERATIONS-example.org:30001-0] > master.AssignmentManager: Assigning > test,QFO7M,1475986053104.57513956a7b671f4e8da1598c2e2970e. to > example.org,30003,1479780976834 > {noformat} > RegionServer recevied the open region request, and new a RegionOpenHandler to > open the region, but only to find the RIT node's version is not as it > expected. RS transition the RIT ZK node to failed open in the end > {noformat} > 2016-11-23 17:11:46,860 WARN [RS_OPEN_REGION-example.org:30003-1] > coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE to > OPENING for region=57513956a7b671f4e8da1598c2e2970e > 2016-11-23 17:11:46,861 WARN [RS_OPEN_REGION-example.org:30003-1] > handler.OpenRegionHandler: Region was hijacked? Opening cancelled for > encodedName=57513956a7b671f4e8da1598c2e2970e > 2016-11-23 17:11:46,860 WARN [RS_OPEN_REGION-example.org:30003-1] > zookeeper.ZKAssign: regionserver:30003-0x15810b5f633015f, > quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181, > baseZNode=/test-hbase11-func2 Attempt to transition the unassigned node for > 57513956a7b671f4e8da1598c2e2970e from M_ZK_REGION_OFFLINE to > RS_ZK_REGION_OPENING failed, the node existed but was version 3 not the > expected version -1 > {noformat} > Master recevied this zk event and begin to handle RS_ZK_REGION_FAILED_OPEN > {noformat} > 2016-11-23 17:11:46,944 DEBUG [AM.ZK.Worker-pool2-t1] > master.AssignmentManager: Handling RS_ZK_REGION_FAILED_OPEN, > server=example.org,30003,1479780976834, > region=57513956a7b671f4e8da1598c2e2970e, > current_state={57513956a7b671f4e8da1598c2e2970e state=PENDING_OPEN, > ts=1479892306843, server=example.org,30003,1479780976834} > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)