Hello,

The following problem has occured three times in the past 2 weeks for us now, 
twice of which in the last 24 hours. 

Our setup currently is a single server, which runs "everything": namenode, 
datanode and client. We're using JNI in the client as an interface to HDFS.

It seems as if, after a while of operation, the client is unable to write any 
more data to the datanode and throws an exception. The symptoms seem to be 
similar to http://issues.apache.org/jira/browse/HDFS-693 .

Here are the relevant logs from the client:


-----

09/11/11 10:58:21 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor 
exception  for block 
blk_-5973015382562856275_32855java.net.SocketTimeoutException: 63000 millis 
timeout while waiting for channel to be ready for read. ch : 
java.nio.channels.SocketChannel[connected local=/127.0.0.1:52549 
remote=/127.0.0.1:50010]
        at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
        at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
        at java.io.DataInputStream.readFully(DataInputStream.java:178)
        at java.io.DataInputStream.readLong(DataInputStream.java:399)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2367)

09/11/11 10:58:22 WARN hdfs.DFSClient: Error Recovery for block 
blk_-5973015382562856275_32855 bad datanode[0] 127.0.0.1:50010

Exception in thread "Thread-9" java.io.IOException: All datanodes 
127.0.0.1:50010 are bad. Aborting...
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2493)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2046)
        at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2209)
Call to org.apache.hadoop.fs.FSDataOutputStream::write failed!

-----

I'm not entirely sure what to make of this: it seems like some sort of hickup, 
since the data node is available quickly afterwards (as a matter of fact, I am 
unable to reproduce this error in a controlled way -- I've only been able to 
notice the applications crash while not watching).

Here are some relevant logs in the DataNode:

-----

2009-11-11 10:56:30,429 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving block blk_6654737502435554218_32905 src: /127.0.0.1:33103 dest: 
/127.0.0.1:50010
2009-11-11 10:56:30,854 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:33103, dest: /127.0.0.1:50010, bytes: 488, op: HDFS_WRITE, cliID: 
DFSClient_attempt_200910192055_3176_r_000000_0, srvID: DS-1076946
65-62.129.149.242-50010-1254143880412, blockid: blk_6654737502435554218_32905
2009-11-11 10:56:30,855 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder 0 for block blk_6654737502435554218_32905 terminating
2009-11-11 10:56:36,184 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:50010, dest: /127.0.0.1:33161, bytes: 222249, op: HDFS_READ, cliID: 
DFSClient_attempt_200910192055_3178_m_000001_0, srvID: DS-10769
4665-62.129.149.242-50010-1254143880412, blockid: blk_-5388231964351796801_32881
2009-11-11 10:56:41,440 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving block blk_698694794042576935_32906 src: /127.0.0.1:33208 dest: 
/127.0.0.1:50010
2009-11-11 10:56:41,463 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:33192, dest: /127.0.0.1:50010, bytes: 246338, op: HDFS_WRITE, cliID: 
DFSClient_-648412294, srvID: DS-107694665-62.129.149.242-50010
-1254143880412, blockid: blk_-5491664625164378664_32900
2009-11-11 10:56:41,473 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder 0 for block blk_-5491664625164378664_32900 terminating
2009-11-11 10:56:41,737 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:33208, dest: /127.0.0.1:50010, bytes: 1757, op: HDFS_WRITE, cliID: 
DFSClient_attempt_200910192055_3177_r_000000_0, srvID: DS-107694
665-62.129.149.242-50010-1254143880412, blockid: blk_698694794042576935_32906
2009-11-11 10:56:41,746 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder 0 for block blk_698694794042576935_32906 terminating
2009-11-11 10:58:27,516 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception in receiveBlock for block blk_-5973015382562856275_32855 
java.io.EOFException: while trying to read 65557 bytes
2009-11-11 10:58:27,517 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder 0 for block blk_-5973015382562856275_32855 Interrupted.
2009-11-11 10:58:27,517 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder 0 for block blk_-5973015382562856275_32855 terminating
2009-11-11 10:58:27,518 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving block blk_-6041681763208648701_32907 src: /127.0.0.1:33655 dest: 
/127.0.0.1:50010
2009-11-11 10:58:27,518 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving block blk_9134662523757703658_32907 src: /127.0.0.1:33688 dest: 
/127.0.0.1:50010
2009-11-11 10:58:27,519 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:50010, dest: /127.0.0.1:33533, bytes: 0, op: HDFS_READ, cliID: 
DFSClient_-1667266750, srvID: DS-107694665-62.129.149.242-50010-1254143880412, 
blockid: blk_6654737502435554218_32905
2009-11-11 10:58:27,519 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
writeBlock blk_-5973015382562856275_32855 received exception 
java.io.EOFException: while trying to read 65557 bytes
2009-11-11 10:58:27,531 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(127.0.0.1:50010, 
storageID=DS-107694665-62.129.149.242-50010-1254143880412, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 65557 bytes
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:264)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:308)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:372)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:524)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
        at java.lang.Thread.run(Thread.java:619)
2009-11-11 10:58:30,442 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:33655, dest: /127.0.0.1:50010, bytes: 9833, op: HDFS_WRITE, cliID: 
DFSClient_-648412294, srvID: DS-107694665-62.129.149.242-50010-1254143880412, 
blockid: blk_-6041681763208648701_32907

-----

Again, not much useful information: the exception shown in the logs has a later 
timestamp than the actual error occurring in the client. 

Now, I'm wondering about a few things:

 - What could be causing these problems? A network error seems out of the 
question, since all communication is done via localhost 127.0.0.1, but I'm 
having trouble figuring out which steps to take next to investigate the problem.

 - Would this problem (or these class of problems) be related to the fact I'm 
only running a single datanode? In other words, would the client still throw an 
exception if it has multiple data nodes to choose from and only a single one 
fails to respond ?

 - Given the fact that the DataNode seems to get back into normal operation 
somewhere after the error occurred, would increasing the timeout from 63 
seconds to, say, 300 seconds be a wise idea ? Are these kind of "hiccups" to be 
expected in normal mode of operation?


Thanks in advance!

Regards,

- Leon Mergen

Reply via email to