Thanks. Spent a lot of time looking at logs and nothing on the reducers
until they start complaining about 'could not complete'.

Found this in the jobtracker log file:

2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
blk_3829493505250917008_9959810java.io.IOException: Bad response 1 for block
blk_3829493505250917008_9959810 from datanode 10.120.41.103:50010
        at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2454)
2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_3829493505250917008_9959810 bad datanode[2]
10.120.41.103:50010
2011-03-16 02:38:47,881 WARN org.apache.hadoop.hdfs.DFSClient: Error
Recovery for block blk_3829493505250917008_9959810 in pipeline
10.120.41.105:50010, 10.120.41.102:50010, 10.120.41.103:50010: bad datanode
10.120.41.103:50010
2011-03-16 02:38:53,133 INFO org.apache.hadoop.hdfs.DFSClient: Could not
complete file
/var/hadoop/tmp/2_20110316_pmta_pipe_2_20_50351_2503122/_logs/history/hadnn01.atlis1_1299879680612_job_201103111641_0312_deliv_2_20110316_pmta_pipe*2_20110316_%5B%281%2F3%29+...QUEUED_T
retrying...

Looking at the logs from the various times this happens, the 'from datanode'
in the first message is any of the data nodes (roughly equal in # of times
it fails), so I don't think it is one specific node having problems.
Any other ideas?

Thanks,

Chris
On Sun, Mar 13, 2011 at 3:45 AM, icebergs <hkm...@gmail.com> wrote:

> You should check the bad reducers' logs carefully.There may be more
> information about it.
>
> 2011/3/10 Chris Curtin <curtin.ch...@gmail.com>
>
> > Hi,
> >
> > The last couple of days we have been seeing 10's of thousands of these
> > errors in the logs:
> >
> >  INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
> >
> >
> /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003
> > retrying...
> > When this is going on the reducer in question is always the last reducer
> in
> > a job.
> >
> > Sometimes the reducer recovers. Sometimes hadoop kills that reducer, runs
> > another and it succeeds. Sometimes hadoop kills the reducer and the new
> one
> > also fails, so it gets killed and the cluster goes into a loop of
> > kill/launch/kill.
> >
> > At first we thought it was related to the size of the data being
> evaluated
> > (4+GB), but we've seen it several times today on < 100 MB
> >
> > Searching here or online doesn't show a lot about what this error means
> and
> > how to fix it.
> >
> > We are running 0.20.2, r911707
> >
> > Any suggestions?
> >
> >
> > Thanks,
> >
> > Chris
> >
>

Reply via email to