Hi,

The last couple of days we have been seeing 10's of thousands of these
errors in the logs:

 INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file
/offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_000003_0/4129371_172307245/part-00003
retrying...
When this is going on the reducer in question is always the last reducer in
a job.

Sometimes the reducer recovers. Sometimes hadoop kills that reducer, runs
another and it succeeds. Sometimes hadoop kills the reducer and the new one
also fails, so it gets killed and the cluster goes into a loop of
kill/launch/kill.

At first we thought it was related to the size of the data being evaluated
(4+GB), but we've seen it several times today on < 100 MB

Searching here or online doesn't show a lot about what this error means and
how to fix it.

We are running 0.20.2, r911707

Any suggestions?


Thanks,

Chris

Reply via email to