[ 
https://issues.apache.org/jira/browse/HBASE-13635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14530978#comment-14530978
 ] 

Elliott Clark commented on HBASE-13635:
---------------------------------------

This cluster is a write dominated cluster so rpc queues are set up as follows:
'hbase.ipc.server.callqueue.handler.factor': 0.7
'hbase.ipc.server.callqueue.read.ratio': 0.3
'hbase.ipc.server.callqueue.scan.ratio': 0.2

Looks like the scheduler is assuming that any requests that aren't mutate are 
read requests, so all of the requests are going to the very small set of read 
handlers.

Read handlers are all stuck waiting on mutating meta.

The call queue is full so anything going in will block. Hence the master being 
considered dead.

> Regions stuck in transition because master is incorrectly assumed dead
> ----------------------------------------------------------------------
>
>                 Key: HBASE-13635
>                 URL: https://issues.apache.org/jira/browse/HBASE-13635
>             Project: HBase
>          Issue Type: Bug
>          Components: master, regionserver
>    Affects Versions: 1.0.0
>            Reporter: Elliott Clark
>            Assignee: Elliott Clark
>
> On master I see:
> {code}
> 15/05/05 20:56:38 INFO master.HMaster: balance hri=hbase:meta,,1.1588230740, 
> src=hbase1375.prn2.facebook.com,16020,1430858968368, 
> dest=hbase1377.prn2.facebook.com,16020,1430884264554
> 15/05/05 20:56:38 INFO master.RegionStates: Transition {1588230740 
> state=OPEN, ts=1430876450098, 
> server=hbase1375.prn2.facebook.com,16020,1430858968368} to {1588230740 
> state=PENDING_CLOSE, ts=1430884598277, 
> server=hbase1375.prn2.facebook.com,16020,1430858968368}
> Tue May 05 21:01:54 PDT 2015, null, java.net.SocketTimeoutException: 
> callTimeout=60000, callDuration=60724: row '' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, 
> hostname=hbase1375.prn2.facebook.com,16020,1430858968368, seqNum=0
> Caused by: java.net.SocketTimeoutException: callTimeout=60000, 
> callDuration=60724: row '' on table 'hbase:meta' at 
> region=hbase:meta,,1.1588230740, 
> hostname=hbase1375.prn2.facebook.com,16020,1430858968368, seqNum=0
> {code}
> On the regionserver I see the following log spew:
> {code}
> 15/05/06 09:30:11 INFO regionserver.HRegionServer: Failed to report region 
> transition, will retry
> org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the 
> failed servers list: hbasectrl054.prn2.facebook.com/10.104.157.28:16020
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:694)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:880)
>       at 
> or^Cg.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:849)
>       at 
> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1173)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
>       at 
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
>       at 
> org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.reportRegionStateTransition(RegionServerStatusProtos.java:8325)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportRegionStateTransition(HRegionServer.java:1863)
>       at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.reportRegionStateTransition(HRegionServer.java:1837)
>       at 
> org.apache.hadoop.hbase.regionserver.handler.CloseRegionHandler.process(CloseRegionHandler.java:157)
>       at 
> org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:128)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>       at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to