Hi Brian,

  Agree, and thanks for the answering.

  Actually in my situation, I also tried telnet to the datanode on 50010 at
the time it was threw timeout exception, and the telnet was working. And I
just tried to set the timeout higher and see what's going to happen. And I
see a lot of time_wait if I do a "netstat -an".

  I added following properties in my hadoop-sites.xml:

<property>
 <name>dfs.socket.timeout</name>
 <value>180000</value>
 <description>dfs socket timeout</description>
</property>

<property>
 <name>dfs.datanode.socket.write.timeout</name>
 <value>3600000</value>
 <description>datanode write timeout</description>
</property>


On Mon, Dec 14, 2009 at 7:29 AM, Brian Bockelman <bbock...@cse.unl.edu>wrote:

>
> On Dec 14, 2009, at 5:21 AM, javateck javateck wrote:
>
> > I have exactly the same issue there.
> >
> > Sometimes really feel helpless, maybe very few people use hadoop as a FS
> at
> > all. I think this is also why people stop using it, there are so many
> issues
> > there, and so few people can help or have the experience.
>
> <soapbox>
> These are the joys of working on a young software project, right?  I would
> point out that many folks answer many questions every day on the mailing
> lists.  If you want every question solved every time, you have the option of
> buying (excellent) support.
>
> As far as distributed file systems go, I've got a lot of experience running
> ones that have more issues and are used by even less folks.  It's not
> pleasant.  If you just need a 30-40TB filesystem (i.e., not a data
> processing system) I'd agree that you can probably find more mature systems.
>  If you use HDFS as a file system only and don't see clear benefits over
> Lustre, then perhaps you should be using Lustre in the first place.
> </soapbox>
>
> With regards to the error below, I'd guess that it is caused by a
> networking partition - i.e., it appears that the client couldn't open a
> socket connection to 10.1.75.125 from 10.1.75.11.  I'd check for routing
> issues on both nodes.  Does the error happen intermittently for any two
> nodes or if you look through the past incidents, does it always involve the
> same node?
>
> Brian
>
> >
> >
> > On Wed, Nov 25, 2009 at 11:27 AM, David J. O'Dell <dod...@videoegg.com
> >wrote:
> >
> >> I have 2 clusters:
> >> 30 nodes running 0.18.3
> >> and
> >> 36 nodes running 0.20.1
> >>
> >> I've intermittently seen the following errors on both of my clusters, it
> >> happens when writing files.
> >> I was hoping this would go away with the new version but I see the same
> >> behavior on both versions.
> >> The namenode logs don't show any problems, its always on the client and
> >> datanodes.
> >>
> >> Below is any example from this morning, unfortunately I haven't found a
> bug
> >> or config that specifically addresses this issue.
> >>
> >> Any insight would be greatly appreciated.
> >>
> >> Client log:
> >> 09/11/25 10:54:15 INFO hdfs.DFSClient: Exception in
> createBlockOutputStream
> >> java.net.SocketTimeoutException: 69000 millis timeout while waiting for
> >> channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected
> >> local=/10.1.75.11:37852 remote=/10.1.75.125:50010]
> >> 09/11/25 10:54:15 INFO hdfs.DFSClient: Abandoning block
> >> blk_-105422935413230449_22608
> >> 09/11/25 10:54:15 INFO hdfs.DFSClient: Waiting to find target node:
> >> 10.1.75.125:50010
> >>
> >> Datanode log:
> >> 2009-11-25 10:54:51,170 ERROR
> >> org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
> >> 10.1.75.125:50010,
> >> storageID=DS-1401408597-10.1.75.125-50010-1258737830230, infoPort=50075,
> >> ipcPort=50020):DataXceiver
> >> java.net.SocketTimeoutException: 120000 millis timeout while waiting for
> >> channel to be ready for connect. ch :
> >> java.nio.channels.SocketChannel[connection-pending remote=/
> >> 10.1.75.104:50010]
> >>      at
> >>
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
> >>      at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:404)
> >>      at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:282)
> >>      at
> >>
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103)
> >>      at java.lang.Thread.run(Thread.java:619)
> >>
> >>
>
>

Reply via email to