[ 
https://issues.apache.org/jira/browse/HBASE-9721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870157#comment-13870157
 ] 

Jimmy Xiang commented on HBASE-9721:
------------------------------------

Looks good. Just one thing, for closeRegion, in AM#unassign (line around 3504), 
we don't handle DoNotRetryIOException. So it will retry and move region to 
fail_to_close state. What should we do here if we closeRegion?

> RegionServer should not accept regionOpen RPC intended for another(previous) 
> server
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-9721
>                 URL: https://issues.apache.org/jira/browse/HBASE-9721
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 0.98.0, 0.96.0
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0, 0.99.0
>
>         Attachments: hbase-9721_v0.patch, hbase-9721_v1.patch, 
> hbase-9721_v2.patch, hbase-9721_v3.patch
>
>
> On a test cluster, this following events happened with ITBLL and CM leading 
> to meta being unavailable until master is restarted. 
> An RS carrying meta died, and master assigned the region to one of the RSs. 
> {code}
> 2013-10-03 23:30:06,611 INFO  
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
> master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:06,611 INFO  
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
> master.RegionStates: Transitioned {1588230740 state=OFFLINE, 
> ts=1380843006601, server=null} to {1588230740 state=PENDING_OPEN, 
> ts=1380843006611, 
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
> 2013-10-03 23:30:06,611 DEBUG 
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
> master.ServerManager: New admin connection to 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> {code}
> At the same time, the RS that meta recently got assigned also died (due to 
> CM), and restarted: 
> {code}
> 2013-10-03 23:30:07,636 DEBUG [RpcServer.handler=17,port=60000] 
> master.ServerManager: REPORT: Server 
> gs-hdp2-secure-1380781860-hbase-8.cs1cloud.internal,60020,1380843002494 came 
> back up, removed it from the dead servers list
> 2013-10-03 23:30:08,769 INFO  [RpcServer.handler=18,port=60000] 
> master.ServerManager: Triggering server recovery; existingServer 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 looks 
> stale, new 
> server:gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362
> 2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000] 
> master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk 
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
>  
> current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820,
>  matches=true
> 2013-10-03 23:30:08,771 DEBUG [RpcServer.handler=18,port=60000] 
> master.ServerManager: 
> Added=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 
> to dead servers, submitted shutdown handler to be executed meta=true
> 2013-10-03 23:30:08,771 INFO  [RpcServer.handler=18,port=60000] 
> master.ServerManager: Registering 
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362
> 2013-10-03 23:30:08,772 INFO  
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
> handler.MetaServerShutdownHandler: Splitting hbase:meta logs for 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> {code}
> AM/SSH sees that the RS that died was carrying meta, but the assignment RPC 
> request was still not sent:
> {code}
> 2013-10-03 23:30:08,791 DEBUG 
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
> master.AssignmentManager: Checking region=hbase:meta,,1.1588230740, zk 
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
>  
> current=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820,
>  matches=true
> 2013-10-03 23:30:08,791 INFO  
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
> handler.MetaServerShutdownHandler: Server 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820 was 
> carrying META. Trying to assign.
> 2013-10-03 23:30:08,791 DEBUG 
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
> master.RegionStates: Offline 1588230740 with current state=PENDING_OPEN, 
> expected state=OFFLINE/SPLITTING/MERGING
> 2013-10-03 23:30:08,791 INFO  
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
> master.RegionStates: Transitioned {1588230740 state=PENDING_OPEN, 
> ts=1380843006611, 
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
>  to {1588230740 state=OFFLINE, ts=1380843008791, server=null}
> 2013-10-03 23:30:09,809 INFO  
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
> zookeeper.ZooKeeperNodeTracker: Unsetting hbase:meta region location in 
> ZooKeeper
> {code}
> Our first attempt at the assign rpc fails, because the new server is now 
> starting. The second attempt though succeeds: 
> {code}
> 2013-10-03 23:30:10,621 INFO  
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
> master.AssignmentManager: Assigning hbase:meta,,1.1588230740 to 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,621 INFO  
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
> master.RegionStates: Transitioned {1588230740 state=OFFLINE, 
> ts=1380843008791, server=null} to {1588230740 state=PENDING_OPEN, 
> ts=1380843010621, 
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
> 2013-10-03 23:30:10,621 DEBUG 
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-1] 
> master.ServerManager: New admin connection to 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,622 DEBUG [RpcServer.handler=22,port=60000] 
> master.ServerManager: REPORT: Server 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 came 
> back up, removed it from the dead servers list
> 2013-10-03 23:30:10,934 DEBUG 
> [MASTER_META_SERVER_OPERATIONS-gs-hdp2-secure-1380781860-hbase-12:60000-2] 
> master.AssignmentManager: Skip assigning {ENCODED => 1588230740, NAME => 
> 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''}, it is already {1588230740 
> state=PENDING_OPEN, ts=1380843010621, 
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}
> {code}
> Note that the start time for the server does not match (1380842900820 is old 
> , 1380843006362 is new). 
> The region server got the rpc to open the region, but failed to change the zk 
> state, because ServerNames is not matching: 
> {code}
> 2013-10-03 23:30:10,601 INFO  [Priority.RpcServer.handler=0,port=60020] 
> regionserver.HRegionServer: Open hbase:meta,,1.1588230740
> 2013-10-03 23:30:10,897 DEBUG 
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: 
> regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from 
> M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING
> 2013-10-03 23:30:10,918 WARN  
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: 
> regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned 
> node for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_OPENING failed, 
> the server that tried to transition was 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 not 
> the expected 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,918 WARN  
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] 
> handler.OpenRegionHandler: Failed transition from OFFLINE to OPENING for 
> region=1588230740
> 2013-10-03 23:30:10,919 WARN  
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] 
> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for 
> encodedName=1588230740
> 2013-10-03 23:30:10,919 INFO  
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] 
> handler.OpenRegionHandler: Opening of region {ENCODED => 1588230740, NAME => 
> 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} failed, transitioning from 
> OFFLINE to FAILED_OPEN in ZK, expecting version 0
> 2013-10-03 23:30:10,919 DEBUG 
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: 
> regionserver:60020-0x1417d489d9b0bd6 Transitioning 1588230740 from 
> M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN
> 2013-10-03 23:30:10,921 WARN  
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] zookeeper.ZKAssign: 
> regionserver:60020-0x1417d489d9b0bd6 Attempt to transition the unassigned 
> node for 1588230740 from M_ZK_REGION_OFFLINE to RS_ZK_REGION_FAILED_OPEN 
> failed, the server that tried to transition was 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380843006362 not 
> the expected 
> gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820
> 2013-10-03 23:30:10,921 WARN  
> [RS_OPEN_META-gs-hdp2-secure-1380781860-hbase-5:60020-0] 
> handler.OpenRegionHandler: Unable to mark region {ENCODED => 1588230740, NAME 
> => 'hbase:meta,,1', STARTKEY => '', ENDKEY => ''} as FAILED_OPEN. It's likely 
> that the master already timed out this open attempt, and thus another RS 
> already has the region.
> {code}
> It seems that the RS behaved correct by not being able to open the region by 
> transitioning the zk assignment node. However, the master fails to timeout 
> the assignment even though the meta region is reported in RIT: 
> {code}
> 2013-10-04 00:14:50,658 DEBUG 
> [gs-hdp2-secure-1380781860-hbase-12.cs1cloud.internal,60000,1380842679864-BalancerChore]
>  master.HMaster: Not running balancer because 1 region(s) in transition: 
> {1588230740={1588230740 state=PENDING_OPEN, ts=1380843010621, 
> server=gs-hdp2-secure-1380781860-hbase-5.cs1cloud.internal,60020,1380842900820}}
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to