Hi Brian, Agree, and thanks for the answering.
Actually in my situation, I also tried telnet to the datanode on 50010 at the time it was threw timeout exception, and the telnet was working. And I just tried to set the timeout higher and see what's going to happen. And I see a lot of time_wait if I do a "netstat -an". I added following properties in my hadoop-sites.xml: <property> <name>dfs.socket.timeout</name> <value>180000</value> <description>dfs socket timeout</description> </property> <property> <name>dfs.datanode.socket.write.timeout</name> <value>3600000</value> <description>datanode write timeout</description> </property> On Mon, Dec 14, 2009 at 7:29 AM, Brian Bockelman <bbock...@cse.unl.edu>wrote: > > On Dec 14, 2009, at 5:21 AM, javateck javateck wrote: > > > I have exactly the same issue there. > > > > Sometimes really feel helpless, maybe very few people use hadoop as a FS > at > > all. I think this is also why people stop using it, there are so many > issues > > there, and so few people can help or have the experience. > > <soapbox> > These are the joys of working on a young software project, right? I would > point out that many folks answer many questions every day on the mailing > lists. If you want every question solved every time, you have the option of > buying (excellent) support. > > As far as distributed file systems go, I've got a lot of experience running > ones that have more issues and are used by even less folks. It's not > pleasant. If you just need a 30-40TB filesystem (i.e., not a data > processing system) I'd agree that you can probably find more mature systems. > If you use HDFS as a file system only and don't see clear benefits over > Lustre, then perhaps you should be using Lustre in the first place. > </soapbox> > > With regards to the error below, I'd guess that it is caused by a > networking partition - i.e., it appears that the client couldn't open a > socket connection to 10.1.75.125 from 10.1.75.11. I'd check for routing > issues on both nodes. Does the error happen intermittently for any two > nodes or if you look through the past incidents, does it always involve the > same node? > > Brian > > > > > > > On Wed, Nov 25, 2009 at 11:27 AM, David J. O'Dell <dod...@videoegg.com > >wrote: > > > >> I have 2 clusters: > >> 30 nodes running 0.18.3 > >> and > >> 36 nodes running 0.20.1 > >> > >> I've intermittently seen the following errors on both of my clusters, it > >> happens when writing files. > >> I was hoping this would go away with the new version but I see the same > >> behavior on both versions. > >> The namenode logs don't show any problems, its always on the client and > >> datanodes. > >> > >> Below is any example from this morning, unfortunately I haven't found a > bug > >> or config that specifically addresses this issue. > >> > >> Any insight would be greatly appreciated. > >> > >> Client log: > >> 09/11/25 10:54:15 INFO hdfs.DFSClient: Exception in > createBlockOutputStream > >> java.net.SocketTimeoutException: 69000 millis timeout while waiting for > >> channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected > >> local=/10.1.75.11:37852 remote=/10.1.75.125:50010] > >> 09/11/25 10:54:15 INFO hdfs.DFSClient: Abandoning block > >> blk_-105422935413230449_22608 > >> 09/11/25 10:54:15 INFO hdfs.DFSClient: Waiting to find target node: > >> 10.1.75.125:50010 > >> > >> Datanode log: > >> 2009-11-25 10:54:51,170 ERROR > >> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( > >> 10.1.75.125:50010, > >> storageID=DS-1401408597-10.1.75.125-50010-1258737830230, infoPort=50075, > >> ipcPort=50020):DataXceiver > >> java.net.SocketTimeoutException: 120000 millis timeout while waiting for > >> channel to be ready for connect. ch : > >> java.nio.channels.SocketChannel[connection-pending remote=/ > >> 10.1.75.104:50010] > >> at > >> > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) > >> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404) > >> at > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:282) > >> at > >> > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) > >> at java.lang.Thread.run(Thread.java:619) > >> > >> > >