Sergey Shelukhin created HBASE-10210:
----------------------------------------

             Summary: during master startup, RS can be you-are-dead-ed by 
master in error
                 Key: HBASE-10210
                 URL: https://issues.apache.org/jira/browse/HBASE-10210
             Project: HBase
          Issue Type: Bug
    Affects Versions: 0.96.1.1
            Reporter: Sergey Shelukhin


Not sure of the root cause yet, I am at "how did this ever work" stage.
We see this problem in 0.96.1, but didn't in 0.96.0 + some patches.

It looks like RS information arriving from 2 sources - ZK and server itself, 
can conflict. Master doesn't handle such cases (timestamp match), and anyway 
technically timestamps can collide for two separate servers.

So, master YouAreDead-s the already-recorded reporting RS, and adds it too. 
Then it discovers that the new server has died with fatal error!

Note the threads.
Addition is called from master initialization and from RPC.
{noformat}
2013-12-19 11:16:45,290 INFO  
[master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.ServerManager: 
Finished waiting for region servers count to settle; checked in 2, slept for 
18262 ms, expecting minimum of 1, maximum of 2147483647, master is running.
2013-12-19 11:16:45,290 INFO  
[master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.ServerManager: 
Registering 
server=h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
2013-12-19 11:16:45,290 INFO  
[master:h2-ubuntu12-sec-1387431063-hbase-10:60000] master.HMaster: Registered 
server found up in zk but who has not yet reported in: 
h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
2013-12-19 11:16:45,380 INFO  [RpcServer.handler=4,port=60000] 
master.ServerManager: Triggering server recovery; existingServer 
h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800 looks 
stale, new 
server:h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
2013-12-19 11:16:45,380 INFO  [RpcServer.handler=4,port=60000] 
master.ServerManager: Master doesn't enable ServerShutdownHandler during 
initialization, delay expiring server 
h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800
...
2013-12-19 11:16:46,925 ERROR [RpcServer.handler=7,port=60000] master.HMaster: 
Region server 
h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800 
reported a fatal error:
ABORTING region server 
h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800: 
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently 
processing 
h2-ubuntu12-sec-1387431063-hbase-8.cs1cloud.internal,60020,1387451803800 as 
dead server

{noformat}

Presumably some of the recent ZK listener related changes b



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to