Hi all, We have a production cluster where we are seeing periodic client RPC timeouts as well as responseTooSlow warnings, scanner lease expirations, and hdfs read timeouts on some regions servers. We've been trying to diagnose this for a couple weeks, but still no luck on finding a root cause. Anybody with similar past experiences or thoughts on other things we can look for would be very much appreciated.
When the issues first showed up, it seemed to be isolated to a single region server so we suspected hardware issues. However, after dropping it out of the cluster, two other servers started showing similar problems. We've run hbck and hdfs fsck and they come up clean. After suspecting the culprit might be long GC pauses in the region server, we enabling GC logging but that didn't show anything too crazy (occasional promotion failures causing a 4-5 second pause, but even those don't seem to line up with errors and warnings in the region server logs). Could this simply be a matter of too much load causing I/O to block for long periods of time? We've been trying to correlate the problems in the region server logs with anything in our environment that might cause huge spikes on read or write load but so far no smoking gun. We've also tried playing with the OS's disk write buffer settings (like vm.dirty_background_ratio and vm.dirty_ratio) but no luck. Our cluster is certainly under moderate read and write loads, but nothing that I would have thought would cause problems like the 60 second HDFS read timeouts. Here is one example of those timeouts from the log: 2016-08-05 20:53:19,194 WARN [regionserver60020.replicationSource,a2] hdfs.BlockReaderFactory: I/O error constructing remote block reader. org.apache.hadoop.net.ConnectTimeoutException: 60000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=/10.55.30.235:50010] at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534) at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:777) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:694) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:355) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:618) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:844) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:896) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:697) at java.io.DataInputStream.readInt(Unknown Source) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.setTrailerIfPresent(ProtobufLogReader.java:186) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initInternal(ProtobufLogReader.java:155) at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.initReader(ProtobufLogReader.java:106) at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.init(ReaderBase.java:69) at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:126) at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:89) at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createReader(HLogFactory.java:77) at org.apache.hadoop.hbase.replication.regionserver.ReplicationHLogReaderManager.openReader(ReplicationHLogReaderManager.java:68) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.openReader(ReplicationSource.java:503) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource.run(ReplicationSource.java:309) Those timeouts seem to occur virtually anywhere, not just while replicating the WAL to our other cluster. And they aren't limited to a single region or even a single table. So any thoughts on where we could look next? Anybody seen this before and attributed it to anything other than spiky loads? Any good way to identify abnormal load spikes? Thanks, --Jacob