Re: Copyphase dies on timeout

Simon Willnauer Wed, 09 Dec 2009 10:46:06 -0800

Hmm it looks like the reducer hasn't even started. and I do not have
an external resource at all.


simon

On Wed, Dec 9, 2009 at 7:39 PM, Fuad Efendi <[email protected]> wrote:
> I had similar issues before; Reduce can do some 'additional' processing and 
> have dependency on external services.
>
> For instance, in my case it sends documents to SOLR (Lucene based). So that I 
> have dependency on Lucene in a reducer. And, Lucene sometimes needs 30-50 
> minutes to perform index merge. If your timeout settings are not enough, job 
> will be killed. You may also execute reporter.keepAlive() each 60 seconds (in 
> a separate thread spawned by your reducer).
>
> See at <name>mapred.task.timeout</name>, and use reporter.progress(), 
> keepAlive(), etc.
>
>
> Fuad Efendi
> +1 416-993-2060
> http://www.tokenizer.ca
> Data Mining, Vertical Search
>
>
>> -----Original Message-----
>> From: Simon Willnauer [mailto:[email protected]]
>> Sent: December-09-09 1:31 PM
>> To: [email protected]
>> Subject: Copyphase dies on timeout
>>
>> Hi folks,
>>
>> I do have a questions regarding a map-reduce job who gets killed
>> during the copy phase due to a timeout. I don't really understand the
>> output completely so I first look for some help to understand what is
>> going on.
>> There are about 50 mapper running where each mapper succeeds. the job
>> is configured with a single reduces (I assume this could be closely
>> related to that?!)
>> So this is the output:
>>
>> 2009-12-09 15:01:45,103 INFO org.apache.hadoop.mapred.Merger: Merging
>> 5 sorted segments
>> 2009-12-09 15:01:45,103 INFO org.apache.hadoop.mapred.Merger: Down to
>> the last merge-pass, with 5 segments left of total size: 216092233
>> bytes
>> 2009-12-09 15:01:50,727 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_200912081402_0042_r_000000_0 Merge of the 5 files in-memory
>> complete. Local file is
>> /data/hadoop-
>> data/mapred/local/taskTracker/jobcache/job_200912081402_0042/attempt_20091
>> 2081402_0042_r_000000_0/output/map_275.out
>> of size 216092229
>> 2009-12-09 15:01:50,728 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_200912081402_0042_r_000000_0 Thread waiting: Thread for
>> merging on-disk files
>> 2009-12-09 15:01:58,109 INFO org.apache.hadoop.mapred.ReduceTask: Read
>> 53281055 bytes from map-output for
>> attempt_200912081402_0042_m_000253_0
>> 2009-12-09 15:01:58,148 INFO org.apache.hadoop.mapred.ReduceTask: Rec
>> #1 from attempt_200912081402_0042_m_000253_0 -> (71, 103) from
>> hadoop32
>> 2009-12-09 15:01:58,273 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_200912081402_0042_r_000000_0 Need another 44 map output(s)
>> where 3 is already in progress
>> 2009-12-09 15:01:58,273 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts
>> and41 dup hosts)
>> 2009-12-09 15:02:40,174 INFO org.apache.hadoop.mapred.ReduceTask: Read
>> 81297732 bytes from map-output for
>> attempt_200912081402_0042_m_000112_0
>> 2009-12-09 15:02:40,175 INFO org.apache.hadoop.mapred.ReduceTask: Rec
>> #1 from attempt_200912081402_0042_m_000112_0 -> (129, 153) from
>> hadoop31
>> 2009-12-09 15:02:40,175 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_200912081402_0042_r_000000_0 Scheduled 1 outputs (0 slow hosts
>> and16 dup hosts)
>> 2009-12-09 15:02:40,223 INFO org.apache.hadoop.mapred.ReduceTask:
>> Shuffling 82364774 bytes (82364778 raw bytes) into RAM from
>> attempt_200912081402_0042_m_000121_0
>> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: Read
>> 82722679 bytes from map-output for
>> attempt_200912081402_0042_m_000196_0
>> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask: Rec
>> #1 from attempt_200912081402_0042_m_000196_0 -> (114, 137) from
>> hadoop21
>> 2009-12-09 15:02:45,044 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_200912081402_0042_r_000000_0 Scheduled 1 outputs (0 slow hosts
>> and32 dup hosts)
>> 2009-12-09 15:02:45,143 INFO org.apache.hadoop.mapred.ReduceTask:
>> Shuffling 82639206 bytes (82639210 raw bytes) into RAM from
>> attempt_200912081402_0042_m_000195_0
>> 2009-12-09 15:02:58,274 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_200912081402_0042_r_000000_0 Need another 42 map output(s)
>> where 3 is already in progress
>> 2009-12-09 15:02:58,274 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts
>> and39 dup hosts)
>> 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_200912081402_0042_r_000000_0 Need another 42 map output(s)
>> where 3 is already in progress
>> 2009-12-09 15:03:58,275 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_200912081402_0042_r_000000_0 Scheduled 0 outputs (0 slow hosts
>> and39 dup hosts)
>> ......
>>
>> Task attempt_200912081402_0042_r_000000_0 failed to report status for
>> 902 seconds. Killing!
>>
>> my first guess would be some IO problem but I don not have any
>> indicator that those logs. If somebody can clarify or give me a hint
>> would be great. Even if it is something like: use more reducers
>> because of ... <-- this is the interesting part at the moment!
>> btw: this is hadoop-0.20
>>
>> simon
>
>
>

Re: Copyphase dies on timeout

Reply via email to