Hi

I am using version 1.0.1 and the so called reduce hang problem had to do with 
my screw up in cluster configuration, which i have since fixed, or so i think. 
However, this raised some other questions, hence this email.

- I have a bunch of MR jobs that run daily and i noticed that one of them (not 
the same one) would hang. From the mapred admin console, it would be like map 
complete 100% and reduce stuck at some percent in copy phase. After some 
digging around in the task tracker logs, i found that reduce task could not 
copy map outputs. Here is the exception: 2012-06-06 08:17:15,404 WARN 
org.apache.hadoop.mapred.ReduceTask: java.net.UnknownHostException:

- So clearly my slave nodes could not see each other. Everytime reduce task was 
scheduled on one of the slave nodes and one of the mapper tasks on the other 
slave node, this problem would occur, since all slaves were not listed in 
/etc/hosts file on the slave box. I fixed that and all is well. That said, my 
question is shouldn't reduce task time out after a while when it cannot copy 
the mapper output ? It just seems to retry continously. I even let the MR job 
sit there for upto 8 hrs (what usually completes in 10 or 15 mins) to see if it 
would time out and fail the job.

-Giri


Reply via email to