my hbase replication has stopped

I am on hbase version 1.0.0-cdh5.4.8 (Cloudera build)

I have 2 clusters in 2 different datacenters

1 is master the other is slave



I see the following errors in log



2016-04-13 22:32:50,217 WARN
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
Can't replicate because of a local or network error:
java.io.IOException: Call to
hadoop2-private.sjc03.infra.com/10.160.22.99:60020 failed on local
exception: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
id=1014, waitTime=1200001, operationTimeout=1200000 expired.
        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.wrapException(RpcClientImpl.java:1255)
        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1223)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:21783)
        at 
org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:65)
        at 
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:161)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:696)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:410)
Caused by: org.apache.hadoop.hbase.ipc.CallTimeoutException: Call
id=1014, waitTime=1200001, operationTimeout=1200000 expired.
        at org.apache.hadoop.hbase.ipc.Call.checkAndSetTimeout(Call.java:70)
        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1197)
        ... 7 more





which in turn fills the queue and I get

2016-04-13 22:35:19,555 WARN
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint:
Can't replicate because of an error on the remote cluster:
org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.ipc.RpcServer$CallQueueTooBigException):
Call queue is full on /0.0.0.0:60020, is
hbase.ipc.server.max.callqueue.size too small?
        at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1219)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:216)
        at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:300)
        at 
org.apache.hadoop.hbase.protobuf.generated.AdminProtos$AdminService$BlockingStub.replicateWALEntry(AdminProtos.java:21783)
        at 
org.apache.hadoop.hbase.protobuf.ReplicationProtbufUtil.replicateWALEntry(ReplicationProtbufUtil.java:65)
        at 
org.apache.hadoop.hbase.replication.regionserver.HBaseInterClusterReplicationEndpoint.replicate(HBaseInterClusterReplicationEndpoint.java:161)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.shipEdits(ReplicationSource.java:696)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:410)


My peers look good and this was working until Mar 27

we did have an inadvertent outage but I was able to restore all cluster
services



status 'replication'
version 1.0.0-cdh5.4.8
5 live servers
    hadoop5-private.wdc01.infra.com:
       SOURCE: PeerID=1, AgeOfLastShippedOp=1538240180,
SizeOfLogQueue=2135, TimeStampsOfLastShippedOp=Sun Mar 27 04:00:42
GMT+00:00 2016, Replication Lag=1539342209
       SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Tue Mar
22 10:09:39 GMT+00:00 2016
    hadoop2-private.wdc01.infra.com:
       SOURCE: PeerID=1, AgeOfLastShippedOp=810222876,
SizeOfLogQueue=1302, TimeStampsOfLastShippedOp=Mon Apr 04 14:31:37
GMT+00:00 2016, Replication Lag=810287122
       SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Fri Mar
25 21:20:59 GMT+00:00 2016
    hadoop4-private.wdc01.infra.com:
       SOURCE: PeerID=1, AgeOfLastShippedOp=602417946,
SizeOfLogQueue=190, TimeStampsOfLastShippedOp=Thu Apr 07 00:06:38
GMT+00:00 2016, Replication Lag=602983605
       SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Mon Apr
04 14:35:56 GMT+00:00 2016
    hadoop1-private.wdc01.infra.com:
       SOURCE: PeerID=1, AgeOfLastShippedOp=602574285,
SizeOfLogQueue=183, TimeStampsOfLastShippedOp=Thu Apr 07 00:10:29
GMT+00:00 2016, Replication Lag=602753383
       SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Thu Apr
07 00:10:23 GMT+00:00 2016
    hadoop3-private.wdc01.infra.com:
       SOURCE: PeerID=1, AgeOfLastShippedOp=602002192,
SizeOfLogQueue=1148, TimeStampsOfLastShippedOp=Thu Apr 07 00:06:52
GMT+00:00 2016, Replication Lag=602971172
       SINK  : AgeOfLastAppliedOp=0, TimeStampsOfLastAppliedOp=Thu Apr
07 00:06:50 GMT+00:00 2016



I can curl the quorum I set so I don't think its network



What can I do to troubleshoot?



Tried to run the following

hbase org.apache.hadoop.hbase.replication.regionserver.ReplicationSyncUp 100000

got the following response

16/04/13 23:37:17 INFO zookeeper.ClientCnxn: Socket connection
established, initiating session, client: /10.125.122.237:50784,
server: hadoop2-private.sjc03.infra.com/10.160.22.99:2181
16/04/13 23:37:17 INFO zookeeper.ClientCnxn: Session establishment
complete on server hadoop2-private.sjc03.infra.com/10.160.22.99:2181,
sessionid = 0x252f1a90269f5d6, negotiated timeout = 150000
16/04/13 23:37:17 INFO regionserver.ReplicationSource: Replicating
de6643f5-2a36-413e-b55f-8840b26395b1 ->
06a68811-0e50-4802-a478-d199df96bf85
16/04/13 23:37:27 INFO regionserver.ReplicationSource: Closing source
1 because: Region server is closing
16/04/13 23:37:27 WARN regionserver.ReplicationSource: Interrupted
while reading edits
java.lang.InterruptedException
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
        at 
java.util.concurrent.PriorityBlockingQueue.poll(PriorityBlockingQueue.java:553)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.getNextPath(ReplicationSource.java:489)
        at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:308)
16/04/13 23:37:27 INFO zookeeper.ZooKeeper: Session: 0x252f1a90269f5d6 closed
16/04/13 23:37:27 INFO zookeeper.ClientCnxn: EventThread shut down
16/04/13 23:37:27 INFO
client.ConnectionManager$HConnectionImplementation: Closing zookeeper
sessionid=0x152f1a8ff4ef600
16/04/13 23:37:27 INFO zookeeper.ZooKeeper: Session: 0x152f1a8ff4ef600 closed
16/04/13 23:37:27 INFO zookeeper.ClientCnxn: EventThread shut down
16/04/13 23:37:31 INFO zookeeper.ZooKeeper: Session: 0x153ee0d274c3c6a closed
16/04/13 23:37:31 INFO zookeeper.ClientCnxn: EventThread shut down


I am willing to lose the queue if there is a way to flush it and reset the
sync process, because I can distscp of various data and manually load my
tables to play catchup


Or if there are other things I should try to diagnose to find where the log
jam is

Reply via email to