[ https://issues.apache.org/jira/browse/HBASE-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13853667#comment-13853667 ]
Jimmy Xiang commented on HBASE-9721: ------------------------------------ I ran the test 50 times on my local box and got no issue. Do you run the test with the latest code? > RegionServer should not accept regionOpen RPC intended for another(previous) > server > ----------------------------------------------------------------------------------- > > Key: HBASE-9721 > URL: https://issues.apache.org/jira/browse/HBASE-9721 > Project: HBase > Issue Type: Bug > Affects Versions: 0.98.0, 0.96.0 > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Attachments: hbase-9721_v0.patch, hbase-9721_v1.patch, > hbase-9721_v2.patch > > > On a test cluster, this following events happened with ITBLL and CM leading > to meta being unavailable until master is restarted. > An RS carrying meta died, and master assigned the region to one of the RSs. > {code} > 2013-10-03 23:30:06,611 INFO > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] > master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 > 2013-10-03 23:30:06,611 INFO > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] > master.RegionStates: Transitioned {1588230740 state=OFFLINE, > ts=1380843006601, server=null} to {1588230740 state=PENDING_OPEN, > ts=1380843006611, > server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820} > 2013-10-03 23:30:06,611 DEBUG > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] > master.ServerManager: New admin connection to > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 > {code} > At the same time, the RS that meta recently got assigned also died (due to > CM), and restarted: > {code} > 2013-10-03 23:30:07,636 DEBUG [RpcServer.handler=17,port=60000] > master.ServerManager: REPORT: Server > gs-hdp2-secure-1380781860-hbase-8.cs1cloud.internal,60020,1380843002494 came > back up, removed it from the dead servers list > 2013-10-03 23:30:08,769 INFO [RpcServer.handler=18,port=60000] > master.ServerManager: Triggering server recovery; existingServer > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 looks > stale, new > server:gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 > 2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000] > master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk > server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 > > current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820, > matches=true > 2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000] > master.ServerManager: > Added=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 > to dead servers, submitted shutdown handler to be executed meta=true > 2013-10-03 23:30:08,771 INFO [RpcServer.handler=18,port=60000] > master.ServerManager: Registering > server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 > 2013-10-03 23:30:08,772 INFO > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] > handler.MetaServerShutdownHandler: Splitting hbase:meta logs for > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 > {code} > AM/SSH sees that the RS that died was carrying meta, but the assignment RPC > request was still not sent: > {code} > 2013-10-03 23:30:08,791 DEBUG > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] > master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk > server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 > > current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820, > matches=true > 2013-10-03 23:30:08,791 INFO > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] > handler.MetaServerShutdownHandler: Server > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 was > carrying META. Trying to assign. > 2013-10-03 23:30:08,791 DEBUG > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] > master.RegionStates: Offline 1588230740 with current state=PENDING_OPEN, > expected state=OFFLINE/SPLITTING/MERGING > 2013-10-03 23:30:08,791 INFO > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] > master.RegionStates: Transitioned {1588230740 state=PENDING_OPEN, > ts=1380843006611, > server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820} > to {1588230740 state=OFFLINE, ts=1380843008791, server=null} > 2013-10-03 23:30:09,809 INFO > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] > zookeeper.ZooKeeperNodeTracker: Unsetting hbase:meta region location in > ZooKeeper > {code} > Our first attempt at the assign rpc fails, because the new server is now > starting. The second attempt though succeeds: > {code} > 2013-10-03 23:30:10,621 INFO > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] > master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 > 2013-10-03 23:30:10,621 INFO > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] > master.RegionStates: Transitioned {1588230740 state=OFFLINE, > ts=1380843008791, server=null} to {1588230740 state=PENDING_OPEN, > ts=1380843010621, > server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820} > 2013-10-03 23:30:10,621 DEBUG > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] > master.ServerManager: New admin connection to > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 > 2013-10-03 23:30:10,622 DEBUG [RpcServer.handler=22,port=60000] > master.ServerManager: REPORT: Server > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 came > back up, removed it from the dead servers list > 2013-10-03 23:30:10,934 DEBUG > [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] > master.AssignmentManager: Skip assigning {ENCODED => 1588230740, NAME => > 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}, it is already {1588230740 > state=PENDING_OPEN, ts=1380843010621, > server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820} > {code} > Note that the start time for the server does not match (1380842900820 is old > , 1380843006362 is new). > The region server got the rpc to open the region, but failed to change the zk > state, because ServerNames is not matching: > {code} > 2013-10-03 23:30:10,601 INFO [Priority.RpcServer.handler=0,port=60020] > regionserver.HRegionServer: Open hbase:meta,,1.1588230740 > 2013-10-03 23:30:10,897 DEBUG > [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: > regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from > M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING > 2013-10-03 23:30:10,918 WARN > [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: > regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned > node for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, > the server that tried to transition was > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 not > the expected > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 > 2013-10-03 23:30:10,918 WARN > [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] > handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for > region=1588230740 > 2013-10-03 23:30:10,919 WARN > [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] > handler.OpenRegionHandler: Region was hijacked? Opening cancelled for > encodedName=1588230740 > 2013-10-03 23:30:10,919 INFO > [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] > handler.OpenRegionHandler: Opening of region {ENCODED => 1588230740, NAME => > 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} failed, transitioning from > OFFLINE to FAILED_OPEN in ZK, expecting version 0 > 2013-10-03 23:30:10,919 DEBUG > [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: > regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from > M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN > 2013-10-03 23:30:10,921 WARN > [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: > regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned > node for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN > failed, the server that tried to transition was > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 not > the expected > gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 > 2013-10-03 23:30:10,921 WARN > [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] > handler.OpenRegionHandler: Unable to mark region {ENCODED => 1588230740, NAME > => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} as FAILED_OPEN. It's likely > that the master already timed out this open attempt, and thus another RS > already has the region. > {code} > It seems that the RS behaved correct by not being able to open the region by > transitioning the zk assignment node. However, the master fails to timeout > the assignment even though the meta region is reported in RIT: > {code} > 2013-10-04 00:14:50,658 DEBUG > [gs-hdp2-secure-1380781860-hbase-12.cs1cloud.internal,60000,1380842679864-BalancerChore] > master.HMaster: Not running balancer because 1 region(s) in transition: > {1588230740={1588230740 state=PENDING_OPEN, ts=1380843010621, > server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}} > {code} -- This message was sent by Atlassian JIRA (v6.1.4#6159)