The Datanode's use multiple threads with locking and one of the assumptions
is that the block report (1ce per hour by default) takes little time. The
datanode will pause while the block report is running and if it happens to
take a while weird things start to happen.


On Fri, Jan 30, 2009 at 8:59 AM, Karl Kleinpaste <k...@conviva.com> wrote:

> We have a small test cluster, a double master (NameNode+JobTracker) plus
> 2 slaves, running 0.18.1.  We are seeing an intermittent problem where
> our application logs failures out of DFSClient, thus:
>
> 2009-01-30 01:59:42,072 WARN org.apache.hadoop.dfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_7603130349014268849_2349933java.net.SocketTimeoutException: 66000
> millis timeout while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/10.0.10.102:54700
> remote=/10.0.10.108:50010]
>        at
>
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:150)
>        at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:123)
>        at java.io.DataInputStream.readFully(DataInputStream.java:176)
>        at java.io.DataInputStream.readLong(DataInputStream.java:380)
>        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream
> $ResponseProcessor.run(DFSClient.java:2044)
>
> (Apologies for paste formatting.  I hate Evolution.)
>
> Our application here is our "JobConsole," which is responsible for
> taking notifications from an external data-generating application: The
> external app scribbles files into DFS and then tells JobConsole about
> them.  JobConsole submits jobs to crunch that data in response to the
> external app's notifications of data availability.  JobConsole runs on
> the master node.
>
> Chasing that block identifier through our JobConsole log plus the
> DataNode logs on the slaves, we have an odd timeline, which is this:
> 01:58:32        slave (.108, above): receiving blk from master (.102)
> 01:58:35        other slave (.107): receiving blk from .108
> 01:58:36        .107: received blk
> 01:58:38        .107: terminate PacketResponder
> 01:59:42        JobConsole (.102): 66s t.o. + Error Recovery (above)
> 01:59:42        .107: invoke recoverBlock on that blk
> 02:01:15        .108: received blk + terminate PacketResponder
> 03:03:24        .108: deleting blk, from Linux pathname in DFS storage
>
> What's clear from this is that .108 got the block quickly, because it
> was in a position immediately to send a copy to .107, which responded
> promptly enough to say that it was in possession.  But .108's DataNode
> sat on the block for a full 3 minutes before announcing what appears to
> have been ordinary completion and responder termination.  After the
> first minute-plus of that long period, JobConsole gave up and did a
> recovery operation, which appears to work.  If .108's DataNode sent a
> notification when it finally logged completed reception, no doubt there
> was nobody listening for it any more.
>
> What's particularly of interest to us is that the NameNode log shows us
> that the data being moved is job.jar:
>
> 2009-01-30 01:58:32,353 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /usr/local/rinera/hadoop/hadoop-runtime/0/mapred/system/job_200901291752_3021/job.jar.
> blk_7603130349014268849_2349933
>
> Note block name and timestamp.
>
> Does anyone else have knowledge or history with such glitches?  We've
> recently begun seeing a number of problems in communication between task
> management processes and DFS that previously had not been seen, and
> we're trying to nail down where they're coming from, without success.
>
>

Reply via email to