Hi all,

We have a production cluster where we are seeing periodic client RPC timeouts 
as well as responseTooSlow warnings, scanner lease expirations, and hdfs read 
timeouts on some regions servers. We've been trying to diagnose this for a 
couple weeks, but still no luck on finding a root cause. Anybody with similar 
past experiences or thoughts on other things we can look for would be very much 
appreciated.

When the issues first showed up, it seemed to be isolated to a single region 
server so we suspected hardware issues. However, after dropping it out of the 
cluster, two other servers started showing similar problems. We've run hbck and 
hdfs fsck and they come up clean.

After suspecting the culprit might be long GC pauses in the region server, we 
enabling GC logging but that didn't show anything too crazy (occasional 
promotion failures causing a 4-5 second pause, but even those don't seem to 
line up with errors and warnings in the region server logs).

Could this simply be a matter of too much load causing I/O to block for long 
periods of time? We've been trying to correlate the problems in the region 
server logs with anything in our environment that might cause huge spikes on 
read or write load but so far no smoking gun. We've also tried playing with the 
OS's disk write buffer settings (like vm.dirty_background_ratio and 
vm.dirty_ratio) but no luck. Our cluster is certainly under moderate read and 
write loads, but nothing that I would have thought would cause problems like 
the 60 second HDFS read timeouts. Here is one example of those timeouts from 
the log:

2016-08-05 20:53:19,194 WARN  [regionserver60020.replicationSource,a2] 
hdfs.BlockReaderFactory: I/O error constructing remote block reader.
org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while 
waiting for channel to be ready for connect. ch : 
java.nio.channels.SocketChannel[connection-pending remote=/10.55.30.235:50010]
                at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
                at 
org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444)
                at 
org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777)
                at 
org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694)
                at 
org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355)
                at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:618)
                at 
org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844)
                at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896)
                at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:697)
                at java.io.DataInputStream.readInt(Unknown Source)
                at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.setTrailerIfPresent(ProtobufLogReader.java:186)
                at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initInternal(ProtobufLogReader.java:155)
                at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initReader(ProtobufLogReader.java:106)
                at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:69)
                at 
org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:126)
                at 
org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89)
                at 
org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77)
                at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68)
                at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:503)
                at 
org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:309)

Those timeouts seem to occur virtually anywhere, not just while replicating the 
WAL to our other cluster. And they aren't limited to a single region or even a 
single table.

So any thoughts on where we could look next? Anybody seen this before and 
attributed it to anything other than spiky loads? Any good way to identify 
abnormal load spikes?

Thanks,

--Jacob

Reply via email to