Hi Serega,

The iostat data shows a very sharp spike in await time (I know is outside
of the logs time range) and utilization is high  but hard to tell if the
drives in the DNs are getting saturated continuously since it looks like
averaged metric. Is this some kind of virtualized environment? are you
using a NAS for the data volumes? If you look into the logs in context,
seems that there is a bad data node that is causing issues in the HDFS
pipeline and causing issues.

cheers,
esteban.



--
Cloudera, Inc.


On Thu, Apr 23, 2015 at 1:37 PM, Serega Sheypak <serega.shey...@gmail.com>
wrote:

> Hi, is there any input here? What we should monitor?
>
> 2015-04-22 20:55 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>:
>
> > Here is disk stats. Sadness appeared ad 12.30 - 13.00
> >
> >
> https://www.dropbox.com/s/lj4r8o10buv1n2o/Screenshot%202015-04-22%2020.48.18.png?dl=0
> >
> > 2015-04-22 20:41 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>:
> >
> >> Here is an image
> >>
> >> 2015-04-22 20:40 GMT+02:00 Serega Sheypak <serega.shey...@gmail.com>:
> >>
> >>> Here are datanode logs from 5.9.41.237 <http://5.9.41.237:50010/>,
> >>> regionserver logs were from5.9.41.237 <http://5.9.41.237:50010/> also
> >>>
> >>> EQUEST_SHORT_CIRCUIT_FDS, blockid: 1078130838, srvID:
> >>> 659e6be2-8d98-458b-94bc-3bcdbb517508, success: true
> >>> 2015-04-22 12:46:17,154 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> >>> 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId:
> >>> bd37825f7a445e2da6796940ebb754d6, slotIdx: 24, srvID:
> >>> 659e6be2-8d98-458b-94bc-3bcdbb517508, success: true
> >>> 2015-04-22 12:46:17,204 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> >>> 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId:
> >>> aea53fb897c383f3dec304ed618db0df, slotIdx: 4, srvID:
> >>> 659e6be2-8d98-458b-94bc-3bcdbb517508, success: true
> >>> 2015-04-22 12:46:17,219 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> >>> 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId:
> >>> 0178cc0245fcdc9e1dd75c5f8c6da1eb, slotIdx: 0, srvID:
> >>> 659e6be2-8d98-458b-94bc-3bcdbb517508, success: true
> >>> 2015-04-22 12:46:17,236 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src:
> >>> 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId:
> >>> aea53fb897c383f3dec304ed618db0df, slotIdx: 102, srvID:
> >>> 659e6be2-8d98-458b-94bc-3bcdbb517508, success: true
> >>> 2015-04-22 12:46:17,573 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> >>> 5.9.41.237:50010, dest: /5.9.65.143:36281, bytes: 4608, op: HDFS_READ,
> >>> cliID: DFSClient_hb_rs_domain06.myhost.ru
> ,60020,1426776843636_-1084357126_33,
> >>> offset: 28435456, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078147715_4407483, duration:
> >>> 12396486
> >>> 2015-04-22 12:46:17,596 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> >>> 5.9.41.237:50010, dest: /78.46.48.37:41539, bytes: 4608, op:
> HDFS_READ,
> >>> cliID: DFSClient_hb_rs_domain13.myhost.ru
> ,60020,1429640630559_-531755738_33,
> >>> offset: 56052736, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078106901_4366668, duration:
> >>> 37455427
> >>> 2015-04-22 12:46:17,821 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> >>> 5.9.41.237:50010, dest: /5.9.67.100:58718, bytes: 5120, op: HDFS_READ,
> >>> cliID: DFSClient_hb_rs_domain03.myhost.ru
> ,60020,1426877064271_-1246826933_33,
> >>> offset: 77630464, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078119757_4379524, duration:
> >>> 16386940
> >>> 2015-04-22 12:46:18,769 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to send data:
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/78.46.48.37:40503]
> >>> 2015-04-22 12:46:18,769 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> >>> 5.9.41.237:50010, dest: /78.46.48.37:40503, bytes: 393216, op:
> >>> HDFS_READ, cliID: DFSClient_hb_rs_domain13.myhost.ru
> ,60020,1429640630559_-531755738_33,
> >>> offset: 3584, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078155386_4415154, duration:
> >>> 480345113893
> >>> 2015-04-22 12:46:18,769 WARN
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >>> DatanodeRegistration(5.9.41.237,
> >>> datanodeUuid=659e6be2-8d98-458b-94bc-3bcdbb517508, infoPort=50075,
> >>> ipcPort=50020, storageInfo=lv=-56;cid=cluster11;nsid=527111981;c=0):Got
> >>> exception while serving
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078155386_4415154 to /
> >>> 78.46.48.37:40503
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/78.46.48.37:40503]
> >>> at
> >>>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> 2015-04-22 12:46:18,770 ERROR
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> domain02.myhost.ru:50010:DataXceiver
> >>> error processing READ_BLOCK operation  src: /78.46.48.37:40503 dst: /
> >>> 5.9.41.237:50010
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/78.46.48.37:40503]
> >>> at
> >>>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> 2015-04-22 12:46:18,897 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to send data:
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/5.9.74.13:34323]
> >>> 2015-04-22 12:46:18,897 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> >>> 5.9.41.237:50010, dest: /5.9.74.13:34323, bytes: 393216, op:
> HDFS_READ,
> >>> cliID: DFSClient_hb_rs_domain09.myhost.ru
> ,60020,1426772397450_1095315710_33,
> >>> offset: 3584, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078159305_4419073, duration:
> >>> 480265217202
> >>> 2015-04-22 12:46:18,897 WARN
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >>> DatanodeRegistration(5.9.41.237,
> >>> datanodeUuid=659e6be2-8d98-458b-94bc-3bcdbb517508, infoPort=50075,
> >>> ipcPort=50020, storageInfo=lv=-56;cid=cluster11;nsid=527111981;c=0):Got
> >>> exception while serving
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078159305_4419073 to /
> >>> 5.9.74.13:34323
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/5.9.74.13:34323]
> >>> at
> >>>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> 2015-04-22 12:46:18,898 ERROR
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> domain02.myhost.ru:50010:DataXceiver
> >>> error processing READ_BLOCK operation  src: /5.9.74.13:34323 dst: /
> >>> 5.9.41.237:50010
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/5.9.74.13:34323]
> >>> at
> >>>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> 2015-04-22 12:46:18,933 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to send data:
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/5.9.37.17:48803]
> >>> 2015-04-22 12:46:18,933 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> >>> 5.9.41.237:50010, dest: /5.9.37.17:48803, bytes: 393216, op:
> HDFS_READ,
> >>> cliID: DFSClient_hb_rs_domain07.myhost.ru
> ,60020,1426776403757_1987133929_33,
> >>> offset: 3584, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078158534_4418302, duration:
> >>> 480375396714
> >>> 2015-04-22 12:46:18,933 WARN
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >>> DatanodeRegistration(5.9.41.237,
> >>> datanodeUuid=659e6be2-8d98-458b-94bc-3bcdbb517508, infoPort=50075,
> >>> ipcPort=50020, storageInfo=lv=-56;cid=cluster11;nsid=527111981;c=0):Got
> >>> exception while serving
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078158534_4418302 to /
> >>> 5.9.37.17:48803
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/5.9.37.17:48803]
> >>> at
> >>>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> 2015-04-22 12:46:18,934 ERROR
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> domain02.myhost.ru:50010:DataXceiver
> >>> error processing READ_BLOCK operation  src: /5.9.37.17:48803 dst: /
> >>> 5.9.41.237:50010
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/5.9.37.17:48803]
> >>> at
> >>>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> 2015-04-22 12:46:20,478 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to send data:
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/5.9.77.105:36871]
> >>> 2015-04-22 12:46:20,478 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> >>> 5.9.41.237:50010, dest: /5.9.77.105:36871, bytes: 327680, op:
> >>> HDFS_READ, cliID: DFSClient_hb_rs_domain01.myhost.ru
> ,60020,1427209265829_704706779_33,
> >>> offset: 3584, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078159312_4419080, duration:
> >>> 480489227857
> >>> 2015-04-22 12:46:20,478 WARN
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> >>> DatanodeRegistration(5.9.41.237,
> >>> datanodeUuid=659e6be2-8d98-458b-94bc-3bcdbb517508, infoPort=50075,
> >>> ipcPort=50020, storageInfo=lv=-56;cid=cluster11;nsid=527111981;c=0):Got
> >>> exception while serving
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078159312_4419080 to /
> >>> 5.9.77.105:36871
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/5.9.77.105:36871]
> >>> at
> >>>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> 2015-04-22 12:46:20,478 ERROR
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode:
> domain02.myhost.ru:50010:DataXceiver
> >>> error processing READ_BLOCK operation  src: /5.9.77.105:36871 dst: /
> >>> 5.9.41.237:50010
> >>> java.net.SocketTimeoutException: 480000 millis timeout while waiting
> for
> >>> channel to be ready for write. ch :
> >>> java.nio.channels.SocketChannel[connected local=/5.9.41.237:50010
> >>> remote=/5.9.77.105:36871]
> >>> at
> >>>
> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:172)
> >>> at
> >>>
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:220)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:547)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:716)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:486)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:111)
> >>> at
> >>>
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:69)
> >>> at
> >>>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:225)
> >>> at java.lang.Thread.run(Thread.java:745)
> >>> 2015-04-22 12:46:20,677 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> >>> 5.9.41.237:50010, dest: /5.9.65.143:36286, bytes: 5120, op: HDFS_READ,
> >>> cliID: DFSClient_hb_rs_domain06.myhost.ru
> ,60020,1426776843636_-1084357126_33,
> >>> offset: 86653440, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078128115_4387882, duration:
> >>> 17623602
> >>> 2015-04-22 12:46:20,690 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> >>> 5.9.41.237:50010, dest: /78.46.48.38:57022, bytes: 5120, op:
> HDFS_READ,
> >>> cliID: DFSClient_hb_rs_domain14.myhost.ru
> ,60020,1429640630771_156324173_33,
> >>> offset: 103135232, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508,
> blockid:
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078120513_4380280, duration:
> >>> 24431169
> >>> 2015-04-22 12:46:20,747 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> >>> 5.9.41.237:50010, dest: /5.9.40.136:34501, bytes: 5120, op: HDFS_READ,
> >>> cliID: DFSClient_hb_rs_domain04.myhost.ru
> ,60020,1426774142736_-1119788258_33,
> >>> offset: 78469632, srvID: 659e6be2-8d98-458b-94bc-3bcdbb517508, blockid:
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078108223_4367990, duration:
> >>> 76055489
> >>> 2015-04-22 12:46:20,758 INFO
> >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving
> >>> BP-927943268-5.9.77.105-1414682145673:blk_1078160469_4420237 src: /
> >>> 5.9.74.13:35440 dest: /5.9.41.237:50010
> >>>
> >>>
> >>> 2015-04-22 20:30 GMT+02:00 Jean-Marc Spaggiari <
> jean-m...@spaggiari.org>
> >>> :
> >>>
> >>>> Did you send the logs from one of those 3 servers? [5.9.41.237:50010,
> >>>> 5.9.77.105:50010, 5.9.73.19:50010]
> >>>>
> >>>> Sound like something is slowing done everything. Can you extract DN
> logs
> >>>> for the same time?
> >>>>
> >>>> Do you have any tool monitoring the disks and network latency over
> time?
> >>>>
> >>>> If not, can you run iostat and try to reproduce the issue?
> >>>>
> >>>> JM
> >>>>
> >>>> 2015-04-22 14:23 GMT-04:00 Serega Sheypak <serega.shey...@gmail.com>:
> >>>>
> >>>> > > major compaction runs daily.
> >>>> >
> >>>> > >What else do you see in the RS logs?
> >>>> > no error, only *Slow sync cost *
> >>>> >
> >>>> > >How iostat looks like?
> >>>> > please see image. 12.00 - 12.30 is a time when reading/writing
> stopped
> >>>> > [image: Встроенное изображение 1]
> >>>> >
> >>>> >
> >>>> > >Can you share the logs around the time this occurs?
> >>>> > 2015-04-22 12:53:09,996 WARN org.apache.hadoop.ipc.RpcServer:
> >>>> > (responseTooSlow):
> >>>> >
> >>>>
> {"processingtimems":20128,"call":"Multi(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MultiRequest)","client":"
> >>>> > 5.9.75.155:58344
> >>>> >
> >>>>
> ","starttimems":1429692769868,"queuetimems":28034,"class":"HRegionServer","responsesize":8,"method":"Multi"}
> >>>> > 2015-04-22 12:53:09,996 WARN org.apache.hadoop.ipc.RpcServer:
> >>>> > (responseTooSlow):
> >>>> >
> >>>>
> {"processingtimems":21434,"call":"Mutate(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MutateRequest)","client":"
> >>>> > 46.4.0.110:60149
> >>>> >
> >>>>
> ","starttimems":1429692768562,"queuetimems":44263,"class":"HRegionServer","responsesize":2,"method":"Mutate"}
> >>>> > 2015-04-22 12:53:10,093 WARN org.apache.hadoop.ipc.RpcServer:
> >>>> > (responseTooSlow):
> >>>> >
> >>>>
> {"processingtimems":17997,"call":"Mutate(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MutateRequest)","client":"46.4.0.
> >>>> >
> >>>> > 015-04-22 12:53:10,270 WARN org.apache.hadoop.ipc.RpcServer:
> >>>> > (responseTooSlow):
> >>>> >
> >>>>
> {"processingtimems":18175,"call":"Mutate(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$MutateRequest)","client":"
> >>>> > 144.76.218.107:48620
> >>>> >
> >>>>
> ","starttimems":1429692772095,"queuetimems":49253,"class":"HRegionServer","responsesize":2,"method":"Mutate"}
> >>>> > 2015-04-22 12:53:10,315 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 319
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:10,315 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 319
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:10,315 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 319
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:10,315 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 319
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:10,315 WARN org.apache.hadoop.ipc.RpcServer:
> >>>> > (responseTooSlow):
> >>>> > {"processingtimems":20231,"call":"Mutate(org.apache.hadoop.hbase.
> >>>> >
> >>>> > 2015-04-22 12:53:11,726 INFO
> >>>> org.apache.hadoop.hbase.regionserver.HStore:
> >>>> > Added
> >>>> >
> >>>>
> hdfs://nameservice1/hbase/data/default/cross_id_to_rtb_user/d2e01ed1de47dd92edb5c1de37277d4a/c/a93405e51088418ba95241cf53456011,
> >>>> > entries=926, sequenceid=43403085, filesize=258.5 K
> >>>> > 2015-04-22 12:53:11,726 INFO
> >>>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>>> > Finished memstore flush of ~3.8 M/3985504, currentsize=12.6 K/12912
> >>>> for
> >>>> > region
> >>>> >
> >>>>
> cross_id_to_rtb_user,aadf0929,1429256709528.d2e01ed1de47dd92edb5c1de37277d4a.
> >>>> > in 857ms, sequenceid=43403085, compaction requested=true
> >>>> > 2015-04-22 12:53:11,726 INFO
> >>>> org.apache.hadoop.hbase.regionserver.HRegion:
> >>>> > Starting compaction on c in region
> >>>> >
> >>>>
> cross_id_to_rtb_user,aadf0929,1429256709528.d2e01ed1de47dd92edb5c1de37277d4a.
> >>>> > 2015-04-22 12:53:11,726 INFO
> >>>> org.apache.hadoop.hbase.regionserver.HStore:
> >>>> > Starting compaction of 3 file(s) in c of
> >>>> >
> >>>>
> cross_id_to_rtb_user,aadf0929,1429256709528.d2e01ed1de47dd92edb5c1de37277d4a.
> >>>> > into
> >>>> >
> >>>>
> tmpdir=hdfs://nameservice1/hbase/data/default/cross_id_to_rtb_user/d2e01ed1de47dd92edb5c1de37277d4a/.tmp,
> >>>> > totalSize=8.0 M
> >>>> > 2015-04-22 12:53:11,917 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 417
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:11,939 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 340
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:11,939 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 340
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:11,939 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 340
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:11,939 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 340
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:12,199 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost: 282
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:13,132 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 1193
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:13,132 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 1193
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:13,132 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 1193
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:13,132 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.wal.FSHLog: Slow sync cost:
> 1193
> >>>> ms,
> >>>> > current pipeline: [5.9.41.237:50010, 5.9.77.105:50010,
> >>>> 5.9.73.19:50010]
> >>>> > 2015-04-22 12:53:13,247 INFO
> >>>> > org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher: Flushed,
> >>>> > sequenceid=41975630, memsize=3.6 M, hasBloomFilter=true, into tmp
> file
> >>>> >
> >>>>
> hdfs://nameservice1/hbase/data/default/rtb_user_to_cross_id/c54d51dd7f20377ea34f05824abbaf76/.tmp/3495a2b487fa448bb9e0b50050b11c3a
> >>>> > 2015-04-22 12:53:13,301 INFO
> >>>> org.apache.hadoop.hbase.regionserver.HStore:
> >>>> > Added
> >>>> >
> >>>>
> hdfs://nameservice1/hbase/data/default/rtb_user_to_cross_id/c54d51dd7f20377ea34f05824abbaf76/c/3495a2b487fa448bb9e0b50050b11c3a,
> >>>> > entries=886, sequenceid=41975630, filesize=272.9 K
> >>>> >
> >>>> > 2015-04-22 19:54 GMT+02:00 Ted Yu <yuzhih...@gmail.com>:
> >>>> >
> >>>> >> Serega:
> >>>> >> How often is major compaction run in your cluster ?
> >>>> >>
> >>>> >> Have you configured offpeak compaction ?
> >>>> >> See related parameters in:
> >>>> >> http://hbase.apache.org/book.html#compaction.parameters
> >>>> >>
> >>>> >> Cheers
> >>>> >>
> >>>> >> On Wed, Apr 22, 2015 at 10:39 AM, Serega Sheypak <
> >>>> >> serega.shey...@gmail.com>
> >>>> >> wrote:
> >>>> >>
> >>>> >> > Hi, we have 10 nodes cluster running HBase 0.98 CDH 5.2.1
> >>>> >> > Sometimes HBase stucks.
> >>>> >> > We have several apps constantly writing/reading data from it.
> >>>> Sometimes
> >>>> >> we
> >>>> >> > see that apps response time dramatically increases. It means that
> >>>> app
> >>>> >> > spends seconds to read/write from/to HBase. in 99% of time it
> takes
> >>>> >> 20ms.
> >>>> >> >
> >>>> >> > I suppose that compactions/major compactions could be the root
> >>>> cause. I
> >>>> >> see
> >>>> >> > that compactions start at the same time when we have problems
> with
> >>>> app.
> >>>> >> > Could it be so?
> >>>> >> > So HBase can't write to WAL because compactions consumes all IO
> >>>> and apps
> >>>> >> > stops to write data?
> >>>> >> >
> >>>> >>
> >>>> >
> >>>> >
> >>>>
> >>>
> >>>
> >>
> >
>

Reply via email to